More Luigi: Presentation from OSCON

I was in Portland, OR for a few days hanging out at OSCON. Was fun. I also talked a bit about Luigi: Next week I'm presenting at the NYC Predictive Analytics meetup together with Blake Shaw from Foursquare.

From: Erik Bernhardsson

Optimizing over multinomial distributions

Sometimes you have to maximize some function $$ f(w_1, w_2, ldots, w_n) $$ where $$ w_1 + w_2 + ldots + w_n = 1 $$ and $$ 0 le w_i le 1 $$ . Usually, $$ f $$ is concave and differentiable, so there's one unique global maximum and you can solve it by applying gradient ascent.

From: Erik Bernhardsson

More Luigi!

Continuing in the same spirit of shameless self-promotion, here's some recent Luigi press: Reddit thread A Guide to Python Frameworks for Hadoop (slides from the NYC Hadoop User Group) This presentation from the Open Analytics NYC meetup about how Foursquare uses Luigi  Luigi is in the middle of a...

From: Erik Bernhardsson

hdfs2cass

Just open sourced hdfs2cass which is a Hadoop job (written in Java) to do efficient Cassandra bulkloading. The nice thing is that it queries Cassandra for its topology and uses that to partition the data so that each reducer can upload data directly to a Cassandra node.

From: Erik Bernhardsson

NoDoc

We had an unconference at Spotify last Thursday and I added a semi-trolling semi-serious topic about abolishing documentation. Or NoDoc, as I'm going to call this movement. This was meant to be mostly a thought experiment, but I don't see it as complete madness.

From: Erik Bernhardsson

Wikiphilia

I've been obsessed with Wikipedia for the past ten years. Occasionally I find some good articles worth sharing and that's why I created the wikiphilia Twitter handle. Just a long stream of stuff that for one reason or another may be interesting.

From: Erik Bernhardsson

Spotify's Discovery page

The Discovery page, the new start page in Spotify, is finally out to a fairly significant percentage of all users. Really happy since we have worked on it for the past six months. Here's a screen shot:

From: Erik Bernhardsson

Fermat's principle

I was browsing around on the Internet and the physics geek in me started reading about Fermat's principle. And suddenly something came back to me that I've been trying to suppress for many years – how I never understood why there's anything fundamental about the principal of least time.

From: Erik Bernhardsson

Snakebite

Just promoting Spotify stuff here: check out the Snakebite repo on Github, written by Wouter de Bie. It's a super fast tool to access HDFS over CLI/Python, by accessing the namenode directly over sockets/protobuf. Spotify's developer blog features a nice blog post outlining what it's useful for.

From: Erik Bernhardsson

Instrumenting Clojure for New Relic Monitoring

We've recently started evaluating the New Relic monitoring service at World Singles and when you use their Java agent with your web application container, you can get a lot of information about what's going on inside your application (JVM activity, database activity, external HTTP calls, web transac...

From: Sean Corfield: An Architect's View

Stuff that bothers me: “100x faster than Hadoop”

The simple way to get featured on big data blog these days seem to be Build something that does 1 thing super well but nothing else Benchmark it against Hadoop Publish stats showing that it's 100x faster than Hadoop $$$ Spark claims their 100x faster than Hadoop and there's a lot of stats showing ...

From: Erik Bernhardsson

Presentation about Luigi

I like the editing!

From: Erik Bernhardsson

Being data driven

I picked up an issue of Foreign Affairs while flying back to NYC from SFO. It features this long interview with U.S. General Stanley McChrystal and I thought it was pretty interesting how striking some of the similarities are between fighting in a war and developing software.

From: Erik Bernhardsson

Annoy

Annoy is a simple package to find approximate nearest neighbors (ANN) that I just put on Github. I'm not trying to compete with existing packages, but Annoy has a couple of features that makes it pretty useful.

From: Erik Bernhardsson

More Luigi!

Elias Freider just talked about Luigi at PyData 2013: The presentation above is much better than one I put together a few weeks ago. In case anyone is interested I'll include it too:

From: Erik Bernhardsson

ML at Twitter

I recently came across this paper describing how they do ML at Twitter. TL;DR Their approach is pretty interesting. Everything is a Pig workflow and then they do everything as UDF's. This approach seems pretty interesting.

From: Erik Bernhardsson

I'm featured in Mashable

This article from today in Mashable describes some of the fun stuff I get to work with: Erik Bernhardsson is technical lead at Spotify, where he helped to build a music recommendation system based on large-scale machine learning algorithms, mainly matrix factorization of big matrices using Hadoop.

From: Erik Bernhardsson

Slides from NYC Machine Learning talk

Slides from the talk. Slightly edited because (a) some of the slides make little sense taken out of context (b) Slideshare seem to have problem converting some of the stuff. Collaborative filtering at Spotify from Erik Bernhardsson

From: Erik Bernhardsson

NYC Machine Learning meetup

From the NYC Machine Learning talk I had last week: Haven't looked at it yet except briefly. Unfortunately the quality isn't the best.

From: Erik Bernhardsson

Momentum and mean reversion might just be volatility bias

The Economist just published an article called The best, the worst and the ugly.

From: Erik Bernhardsson