Google to offer web cache

May 5, 2005

Fron CNet:

A beta, or test version, of Web Accelerator was introduced via the Google Labs technology incubation site late Wednesday. The tool, which must be downloaded, will tap into the power of Google’s global computer network and thus help sites load faster, according to the company.

Web Accelerator works by sending URL requests through company servers designated specifically for speeding site downloads. The application also can compress site data before sending it to computers.


Testing New Anti-Spam Software

April 27, 2005

Based on my sysadmin’s recomendation, I am experimenting with CRM114. CRM114 is a system that needs training to operate well. It requires training on error rather than training on a corpus of prototypes. So, I wanted to have it installed so that all legit mail comes to my inbox and all my spam to a spam folder. While training I will check both folders, saving false positives in a teach-non-spam folder and false negatives in a teach-spam folder. For this to work I needed a program to grab the contents of my teach folders and pass them to CRM114 for training. So I wrote some code that I actually put into production! If it works, I will put into into production for the other people on this server. It works for anyone using an IMAP server with access to shell. Now I have cron run it every 5 minutes and I am now training. Yay. The source is here. Feel free to use it and tell me what you think.


HAppS 0.2 Released!

February 22, 2005

I’ve been working on an application server in Haskell on and off for a while. Here is the announcement I just posted to the Haskell mailing lists:

HAppS is a Haskell library for building Internet applications, featuring:

  • HAppS.ACID: Guarantee application integrity in the face of unplanned outages using this module’s integrated write-ahead logging and checkpointing framework.
  • HAppS.DBMS: Do relational operations in Haskell (rather than SQL) on Haskell sets (outside the IO Monad). Define custom indices for your Haskell datatypes (e.g. geographic/geometric types). Use in combination with ACID for a robust relational DBMS customized for your application.
  • HAppS.Protocols: Expose your application using as an HTTP server and/or by recieving and sending SMTP.

License
HAppS is released by HAppS.org under the terms of the GNU General Public License Version 2, a copy of which is enclosed with this package.
How do I get HAppS?
More information is available at http://HAppS.org
A tarball of HAppS is available at http://HAppS.org/HAppS/happs.tar.gz. HAppS is also available via darcs:

darcs get http://happs.org/HAppS

Another reason to build in Haskell

January 13, 2005

Someone built a web server in Erlang. It scales really well on multiprocessor architectures because of its threading model. I believe that Haskell has a similar threading model and therefore could achieve similar performance.

Via John Udell, Patrick Logan makes this point in the contect of pointing to this register article is saying that multi-core computing is the future and this bodes well for the sorts of threading models found in Erlang and Haskell:

A single Erlang node on a single CPU today can comfortably get into the tens of thousands of dynamic processes. What would your system look like running hundreds of thousands or a million dynamic processes and lot of activity is spent collaborating with other systems also running at that scale?


Thinking about a better open source license

November 16, 2004

The GPL and the MPL combine with Moore’s law and the Internet to allow people to take source private by operating it as a service. I am planning to release an app server open source and want to remedy this problem. Here are the key terms. I’m looking for comments:

Simple Source License, Version 0.1
Copyright (c) 2004 HAppS.org. All rights reserved.

Preamble

The copyright holder of the covered work has released it without charge to the developer community in the hopes that it will be of value and in the hopes that other developers will in turn make and distribute improvements back and in turn get the same value.

This license reflects these hopes by permitting anyone to use, modify, bundle, or provide a service derived from covered work in whatever manner they choose and for whatever they wish to charge as long as they make the associated source freely available as well.

Terms and Conditions
1. Scope of License
This license covers any work that contains notice placed by the copyright holder stating that fact.

2. Covered Works
You may reproduce, modify, display, and/or distribute verbatim copies of covered works.

3. Modified Versions Of Covered Works
You may reproduce, modify, display, and/or distribute modified versions of covered works or patches to original or modified versions of covered works
a. only if they retain the original’s copyright, license, disclaimer, and all notices, if any, thereof,
b. only if they include prominent notice that they derive directly or indirectly from covered work and are NOT verbatim copies thereof, and
c. only if they include some way to identify the covered work from which they directly or indirectly derive.

4. Derived Works and Derived Services
A derived work is any work that results from modification or transformation of derived or covered works. A derived service is any service that depends in part or whole on the operation of derived work. Associated source is any
source used to create a derived work or provide a derived service. You may distribute derived works and/or provide derived services and/or permit others to do the same only if for the following year you make all associated source
available to the recipients of such works and/or users of such services
a. at no charge,
b. at a URL displayed prominently and designated as such in the acknowledgment section of the associated end-user documentation, and
c. under terms conformant with Open Source Definition 1.9 or later as available from the Open Source Initiative or under this license,
d. and only if the license or terms of use conspicuously advises the recipient and/or end user of their rights to the source under the terms of this license.

5. Acknowledgement
Any derived work or service must have end-user documentation with anacknowledgment section containing the following text:
“This product includes software developed by [copyright holder's name] ([copyright holder's URL]).” (replacing bracketed text with its value).


How Much Does it Cost to Serve EBay

June 22, 2004

According to Carlos Perez eBay serves ~400M page views per day on a database of 60 million auctions with 30k lines of code changes per week with 99.92% uptime. According to eBay’s 10-k, it has 41.2M active users and is growing at 50% per year! The 10-k is ambiguous but it looks like eBay depreciated at least $21M of equipment last year. I think eBay is paying WAYYY too much for server ops.

A modern CPU can handle ~10k HTTP requests per second (see these benchmarks. I don’t know how much memory eBay’s database consumes, but we can make some estimates. Lets assume each auction and each user requires 1k of memory. 60M auctions plus 40M users implies 100GB of memory. Now lets double that for indexing and other overhead and we are at 200GB.

Today you can buy a Sun E25K for $3.6M with 72 processors and up to 500GB of memory. That seems like enough to handle eBay’s load. Potential problems

  • Need to save to disk: Use Prevayler style architecture (Note there are major problems with Prevayler itself, but I am describing a style).
  • Memory access bottleneck: Someone with more hardware knowledge, please enlighten me.
  • Failover: Ok, buy two E25Ks. Have the second one always recovering.
  • Code Changes: This one strikes me as more difficult. Getting adequate web server performance requires compilation. You need a development infrastructure that allows dynamic modification of code.

How To Run An Organization

June 15, 2004

I was talking to a friend a few days ago who is trying to organize a comunity living space. He was talking about all the detail he had to manage. Some general thoughts on leading a business/organization:

  • Establish a clear mission/strategy. A leader’s fundamental job is to get everyone to understand it. Once they do, they can execute against it without undue communication burden on the leader. In Built To Last the authors argue that religious adherence to a mission/strategy is the key to the success of great companies. Note, a mission is much more specific than e.g. earn money. It is something the heart can grasp. It is something that excludes as well as included.
  • Let people do the things they are best at even if their best is worse than yours. You need to do the things you are best at. This is basic David Ricardo and it is as true today as it was in the Eighteenth Century
  • Seek excellence. In many areas of cognitive skill (e.g. programming, writing, selling, managing, etc) the best is more than 10x better than the worst. If you can find these 10x better people and get them working for you in the areas they specialize, that is a gigantic win. Organize the company to leverage their excellence and insure against their inadequacies. Note: You too may be both excellent and terribly flawed. Insure against your own flaws too.
  • Trial and Error. In Art and Fear, the authors tell a story of divding a ceramics class in two. One half is told that they will be judged on quantity. At the end of the term, all their projects will be weighed; 50 lbs gets an A, 40 gets a B, etc. The other half will be judged on quality. They only have to produce one piece. It just has to be great. At the end of the term, all the best art came from the quantity half. Think about why. There are lots of reasons. Structure your business so you can iterate quickly and learn fast. Lower the cost of trial, insure against error.
  • Plan and execute. Eisenhower one said ” In preparing for battle I have always found that plans are useless, but planning is indispensable.” Planning helps you react effectively when things go wrong, because part of the planning process is thinking about risk and contingency. By the way, things always go wrong. Don’t freak. Just handle them.

By the way, it is much easier to say these things than to live with them. I am writing this post as much to reinforce these thoughts in myself as I am to share them with my readers.


Premature Optimization vs “Prototyping”

April 14, 2004

“Premature Optimization is the root of all evil.”
–Donald Knuth

In response to <a href=”http://www.alexjacobson.com/archives/2004_04.html#000046″ my post of a relational database management system in HaskellI got a bunch of backchannel feedback from folks about whether Haskell is just a prototype language and about the utility of this sort of code. I’m going to try to answer it here.

Concern: Functional languages may be easier to use but they consume more CPU and memory which can cause problems when you need to scale.

Response I don’t know that Haskell performs substantiallyworse than e.g. Java or Python on various tasks, but, either way, my bet is that CPU and memory have gotten sufficiently cheap so that it takes a while before
scale hits you and until that time, the thing to optimize is development time. Also, scale is more of an architectural and systems issue than a programming language issue. 1-10x differences in memory/cpu just aren’t enough to justify
language decisions given that your choice of algorithm (O(n) vs O(n^2) vs O(log(n)) matter so much more when you get scale and Haskell make it easier to program efficient algorithms. However, although on the memory
consumption side, Haskell does not yet have a 64-bit implementation, its compiler developers say it would not be hard to get there. (Note: There is no 64 bit java either). On the CPU side, there is already GPH, a version ofHaskell that parellizes code accross multiple
CPUs. Haskell’s referential transparency means that it is much more straightforward to parellize execution than it is in traditional imperative programming lanugages.

Concern: Why not use an off-the-shelf RDBMS?

Response:Off the shelf RDBMSs are optimizations for a prior era when memory was VERY expensive and people believed in long transaction locks. Today Sun’s web site offers servers with .5 terabyte of RAM (nearly 2k for every man, woman, and child in the US!) and the management costs of big disk arrays combined with the development and CPU cost of optimizing disk read/writes mean that memory is actually cheaper than disk in many contexts. (Note: Google doesn’t use disk for exactly this reason). The web made long duration transactions obsolete. All the information for a database transaction is available in a single HTTP request. If the server imposes a total order on all HTTP requests (see SEDA), then write ahead logging of HTTP requests combined with periodic serialization of server state to disk obviates the need for much of the transaction infrastructure in databases (see Prevayler). The bonus here is much faster development time and a substantial performance improvement because you are not writing and executing code to martial data/queries to/from some external relational database. (Prevayler in Java claims a 1000x improvement over MySQL and a 3000x improvement over Oracle)

Concern:Why not use a more standard language (databases are a small part of application code and there are tons of examples of how to do stuff in traditional lanuages with OTS DBMSs)?

Response: If I can write a RDBMS in Haskell in a couple of part-time weeks in under 1000 lines of code then you can probably write your application equivalently fast and integrating that app with a Haskell DBMS will be easier than calling an external one (and save you money on system administration). Also I recommend this Paul Graham here:

We knew that everyone else was writing their software in C++ or Perl. But we also knew that that didn’t mean anything. If you chose technology that way, you’d be running Windows. When you choose technology, you have to ignore what otherpeople are doing, and consider only what will work the best.

This is especially true in a startup. In a big company, you can do what all the other big companies are doing. But a startup can’t do what all the other startups do. I don’t think a lot ofpeople realize this, even in startups.

The average big company grows at about ten percent a year. So if you’re running a big company and you do everything the way the average big company does it, you can expect to do as well as the average
big company– that is, to grow about ten percent a year.

The same thing will happen if you’re running a startup, of course. If you do everything the way the average startup does it, you should expect average performance. The problem here is, average performance means that you’ll go out of business. The survival rate for startups is way less than fifty percent. So if you’re running a startup, you had better be doing something odd. If not, you’re in trouble.

Beating the Averages

Concern: OTS RDBMS’s do indexing.

Response: My HARDBMS automatically indexes all fields (and lets you define custom datatypes and indexes for specialized data). Moreover being native means that recursive queries (such as all the friends of friends of friends etc.) can be done with reasonable performance (traditional database require a round trip to the DBMS for every level of recursion and that quickly becomes too hard — note the performance problems of Friendster running MySQL!).


A Relational DBMS in under 1000 lines of (Haskell) code!

April 14, 2004

In thinking about a data storage model for a web app I wanted to develop, and in finding Haskell so concise and expressive, I wondered if one could write a relational DBMS in Haskell in under 1000 lines of code. The answer appears to be yes!

Features:

  • non-destructive-update Haskell DBMS (can use a
    relational database without escaping to the IO
    monad!)

  • supports user defined types
  • supports user defined relations and functions
  • command pattern structure for write-ahead logging
  • Inner,Outer,Left,Right joins on arbitrary
    (user-defined) relations (not just “=”)

  • in-memory/in-process means no disk/marshalling overhead

Risks:

  • functions/aggregates not yet implemented e.g.
    (a<b+c) or (a<max(b))

  • no performance testing — joins expensive!
  • no proof of correctness
  • written by non-academic new haskell developer
  • Not SQL. No Sockets. — should be part of the
    app wrapper used to maintain consistency!

License: GPL

Note: I am an expert neither in Haskell, nor in
relational databases, nor in
relational algebra/set theory/category theory. So
comments/suggestions/recommendations on any aspect
of this code are welcome.

Comment on Haskell: WOW!!!!!! I basically wrote
this without testing just thinking about my
program in terms of transformations between types.
I wrote the test/example code at the end and had
almost no implementation errors in the code! The

Download Source


Choosing a web development framework and database server

September 26, 2003

The current state of the art from a server capacity perspective is cooperative multitasking and queues. The basic notion is that you should not have to pay for all the context switches in the operating system and the CPU when you don’t want/need them. If you design your application to yield everytime it is waiting for something to return then you gain MASSIVE performance improvement both in terms of speed and in terms of not requring huge amounts of memory per simultaneous client. See e.g. The C10K Problem.

In python, your options are async core (aka Medusa) or Twisted.. Medusa is being fecklessly maintained while twisted is in active development so the choice is obviously twisted. If I were working in Java, I might choose Seda.

So having chosen twisted you want to create a src directory. In the src directoy, create a directory with your application name (this is going to be a module name directory or package directory if you are in java). Add the src directory to your compiler or script language loadpath.

In that directory, create a main.py that takes a config and runs your site given your framwork. For Twisted, this looks something like this:


    #!/bin/env python
    from twisted.internet import app
    from twisted.web import static, server,script,vhost

    def start(config):
    root = static.File(config.fileBase)

    root.indices=['index.rpy']
    root.processors = {'.rpy': script.ResourceScript}
    root.ignoreExt(".rpy")

    rootroot = NameVirtualHost()
    rootroot.addHost(config.host,root)

    application = app.Application('web')
    application.listenTCP(config.port, server.Site(rootroot))
    application.run()

    import sys
    if __name__=="__main__":
    configPath=sys.argv[1]
    execfile(configPath)
    main(__name__)

This script allows you to start the site either with config.py as a commandline argument or from another script by passing a config as an argument to start. It also runs the site as a virtual host so that unless someone knows the name of the file your dev site is inaccessible to nosy people on your LAN even if you are exposing the port to the world. Remember to edit your hosts file to get the local name to resolve correctly.