Re-searchr

Wednesday, June 3, 2009

Cloud computing throwdown: EC2 vs GoGrid vs Mosso

Re-searchr is a small site with modest needs, it is also 100% bootstrapped and therefore I keep a sharp eye out for the best deal on hosting. After first going with shared hosting (unsatisfactory for a lot of reasons, but that is a different post), I switched to cloud hosting and have enjoyed the benefits of having root access on all my servers.

Lately I have been doing my quarterly review of my hosting options and this time I thought I would share it with the world, since I am betting there are a LOT of websites similar to Re-searchr, who are starting slow, but want to be able to scale in the cloud when needed.

So here is the architecture I spec out for Re-searchr: 2 Application (Web) Servers, and 1 DB Server. Some sort of load balancing is necessary between the 2 application servers, but it can be handled by Squid or Apache, it doesn’t have to be specialized hardware.

I’ll start my calculations w/ the following, assume 720 hours of time per month, and though I don’t monitor it closely, I’ll assume 20GB upload bandwidth, and 20GB download bandwidth.

GoGrid

For GoGrid, one can choose to go 100% on demand, or choose one of the pre-paid plans. For Re-searchr, which is meant to be up all the time, a pre-paid plan was obviously a good choice. For $100/month the Busines plan seems to align best with my usage. This plan gives me 800 RAM-GB-hrs per month, w/ additional hours being charged at $0.10/RAM-hr. For 3 512MB RAM servers I can then use up a total of: 1.5GB*720 hrs = 1080 GB-hrs. That is over my max, by 220 GB-hrs, which means an overage charge of: $22.00. So the total GoGrid cost for 3 512 MB servers is $122/month. Inbound bandwidth for GoGrid is free of charge, a very nice feature, but outbound bandwidth is charged at $0.5/GB, this adds on $0.5/GB*20GB = $10 to the total. So for one month the total GoGrid charge is right around $132.

GoGrid has a free F5 load balancer (though it doesn’t do much more than just the most simple balancing algorithms).

Mosso

Mosso offers servers in increments of RAM going from 256MB to 15GB (wow!) for different price-points per hour on the way. The 256MB servers are a very cheap $0.015/hr, 512MB servers are $0.03/hr, and 1GB servers are $0.06/hr.

To match GoGrid, lets compute the 512MB servers. For a month we have 720 hrs * 0.03/hr * 3 servers= $65. Bandwidth costs at Mosso are $0.1/GB, both upload and download, coming to to a total of $0.1*2*20=$4.

The total cost for Mosso comes out to $69/month.

Mosso does not offer a hardware load balancer, so load balancing would be done on one of the web nodes (though a 256MB server to run it would only cost ~$12).

Since that is significantly lower than GoGrid, lets compute the next server size up as well (1GB RAM). The monthly server charge is 720 hrs * $0.06/hr * 3 servers = $129.60. Bandwidth costs remain the same, so the total would be $133.60.

Amazon EC2

EC2 offers servers in several sizes, small, medium and large. For Re-searchr the small servers would probably be sufficient. Small servers have 1GB of RAM and are charged at a rate of 0.10/hr . The relevant server cost for 3 servers if therefore $0.10/hr * 3 servers * 720 hrs = $216.00. Amazon offers the option of “reserving” a server for $325/yr and thereby reducing the computing costs to $0.03/hr, per month the total cost for reserving 3 small instances and running them would be: $0.03/hr * 3 servers * 720 hrs + $27 reserve fee = $91.80/month.

Bandwidth costs for Amazon EC2 are $0.10/GB upload and $0.17/GB download, total costs would be: $0.10/GB*20 + $0.17/GB *20 = $5.40.

Amazon EC2 has just recently started load balancing (for a nominal fee). Estimating the load balancing requires knowing how many request per month, etc. This will not be estimated now.

Worst case costs for Amazon are $221.40/month, while w/ reserved instanes, the costs could be as low as $97.20/month

It is of course important to note, not just the RAM values for the servers given above, but also the CPU capacity of the servers. GoGrid claims to supply 512MB servers with “a ratio of 1 Xeon core (equivalent to a P4 2.0 chip) to 4 GB of RAM” (1/8 of P4 2.0), while Mosso supplies a “Fair-weighted Quad Core processors (Guaranteed minimum. Weighting scales up with memory)”, EC2 supplies “the equivalent of a 1.0-1.2 GHZ 2007 Opteron or 2007 Xeon processor”.

All 3 services offer both 32 bit and 64 bit OS’es in a variety of flavors.

What to choose?

My opinion is that it depends… yes that is not much of an opinion. Really one needs to evaluate their application based on whether it will operate better with a large CPU or more RAM. In a lot of cases RAM is more important than CPU because RAM is basically equivalent to caching in memory (either thru MEMCACHE, or other). Lots of site are very DB heavy, meaning that the Web Servers do very little work besides asking the DB for data and then maybe looping through it (not extremely CPU intensive). This is almost the default for the Web Server because most folks want their urls to return in a timely manner, so not too much processing can take place. Then again the heftier a CPU you have the more you can do… but truly most sites probably need a hefty cpu for their DB server, b/c that is where a lot of joining and sorting etc. are really done.

So what does all this mean for Re-searchr, well it means I probably need to try out Mosso, since I am currently on GoGrid. Mosso would be cheaper for 1GB servers, and I could even bump up my DB server to a 1.5GB or 2GB server, and get the best of both worlds. However Mosso also does not publish what CPU levels the various RAM levels they give (I asked them on Twitter, and they said they do not give out that information), so I’d have to load test in order to look at overall performance. Somewhat surprisingly to me, EC2 doesn’t really compete; they are massively more expensive except when agreeing to a year’s term, which is unacceptable for a bootstrapped startup.

Wednesday, April 15, 2009

#hashtags and @replies....

Recently I was intrigued by some Twitter posts by @garrickvanburen relating to whether #hashtags and @reply syntax on Twitter were useful anymore with the advent of Twitter search. The topic of #hashtags has been discussed by Robert Scoble on Friendfeed as well, where Scoble claimed that "hashtags are dead".

I disagree with the sentiment that #hashtags serve no purpose anymore, and I think that part of the issue with people thinking they are dead is a lack of understanding about how search algorithms work (or should work). The key to #hashtags being useful is that they represent a significant act by the user who added them to their Tweet. The act of adding a #hashtag is akin to saying "this post is about #thisTopic", and that is powerful fuel for a good Information Retrieval algorithm that takes advantage of it. Now I realize that not all #hashtags really describe what the post is about, but the majority do and this data should be used.

During my Twitter exchange about this topic I made the following point:

A Tweet with a url and a #hashtag is directly translatable to a Delicious entry.

What I meant by this is that Delicious lets you tag your bookmarks into various categories and then search/sort along those dimensions. So if I write a Tweet that essentially shares a url (a very common use case), and then put a #hashtag on it, I have done the exact same thing. If this is powerful for Delicious it is certainly powerful for Twitter. In fact I would posit it is more powerful for Twitter, because Tweets are only 140 characters which is not a lot of data from which a search algorithm can determine similarity and relevance.

So how does a search algorithm take advantage of this data? Well a very simple way is to "boost" the relevance score for that field. Google does this based on HTML tags (well I think they do or at least did at some point), giving more weight to the text inside a <title> tag than say a inside <h4> tag. There are more complex ways of utilizing this data as well... but that is another post.

Okay so maybe I convinced you that #hashtags are not dead, but you are probably asking about @replies now. Well the argument is the same, if a bit more convoluted. The @reply tags are also meta-data about the post, they indicate who is/are the intended audience for the post. Knowing who is the intended audience means that you can then utilize meta-data you know about that users aggregate activity to "boost" the Tweet up or down. One example of this would be using TunkRank to re-weight a Tweet based on where the user was going. This is essentially giving extra weight to a Tweet going to someone who is an expert, or a power-user.

@replies and #hashtags are important pieces of meta-data (some of the only pieces of meta-data we get in Tweets); they should and will be considered by advanced search/indexing algorithms in the ranking of Tweet search results. So my plea is don't let @replies and #hashtags die, we would lose a useful system for collaborative filtering and intelligence, and make data-mining twitter all the more difficult.

Monday, April 6, 2009

TweeFind: twitter search engines by relevance?

I read this post on Mashable about a new Twitter search engine and I had to comment:

Google's ranking system is a mix of an IR (Information Retrieval) ranking on the similarity of the search term to the content and page rank. The problem here as I see it is 140 characters means that IR score is not very illuminating, so we get back tons of results with say 1 word matching b/t the search term and the tweet. Once we realize this, then we see that the scoring algorithm for "popular" twitter users becomes the driving force behind pushing tweets up or down.

That is all fine, but the fact that the content is so thin, means that flipping results up and down based on a users aggregate behavior TELLS US NOTHING ABOUT THE GOODNESS/BADNESS of the tweet in question. So we get in the situation where a bad tweet by a good author gets promoted, and the results are not very good. My $0.02.

This reasoning is why I think that the best way to evaluate twitter posts, in a search context, is to constrain to posts that share urls. That being said, the current search at http://search.twitter.com is great b/c it is not for finding data or "deep searching" a topic. I think it is perfect for queries that are realtime, like this past weekend when one of my favorite bands canceled a show in Minneapolis, I asked Twitter "gaslight anthem medical emergency" and I got back fantastic results. What I won't ask Twitter search (without advanced operators to limit to tweets with urls, and even then I am dubious if I would get decent results) is "turbogears simpledb integration", because I am looking for deep information. So I guess to me, searching twitter is for realtime, "breadth" searches, and searching Google gives me depth. Of course I could search re-searchr and get both :)

Tuesday, March 17, 2009

Real time re-searchr

Just completed a nice update to re-searchr implementing real-time search inside the standard search from re-searchr's homepage. The realtime results are presented above the normal search results re-searchr displays in the default view, however if you decide to sort by the re-searchr rank the realtime results are ranked right along with all the search results, so you truly see an integrated view of both real-time and crawl based search results.

The wonderful thing about re-searchr is it is not influenced by the same metrics as Google or Yahoo et al, it focuses on the content of the page as well as the intensity of the page's mentions on various social networks. This enables real-time results to get a "fair shake" as compared to standard search engines where pages have to be around a while to accumulate links and get those links counted by the web crawlers.

This update is exciting but there will be more to come soon. What would you like to see re-searchr do first, oauth for twitter, facebook connect login, or updates to the inline search analysis presentations (stars, clocks next to search results)?

Let me know here or on twitter @researchr.

Wednesday, February 25, 2009

What's new...

In the last couple weeks quite few things have been happening on re-searchr, some of which made it to re-searchr's twitter account, some of which didn't, so I figured I'd sum up what has been going on here.

First off re-searchr got a little more social. Going off my theory that social search, and searching socially are different things,
I added some features for searching socially. Briefly, searching socially if a passive thing where your search results can be altered by algorithimic re-ranking based on your social network (and what they have said about various websites). For re-searchr, once you register and add in your social network credentials, re-searchr will use the number of comments and content of the comments from your network for a search result url in its re-ranking of your searches. The re-ranking occurs for searches at re-searchr as well as using the re-searchr toolbar to follow your searches on Google, Yahoo, MSN and Ask.

Secondly re-searchr got an addition to

the rollover previews for search results. The Twitter and Friendfeed icons now have rollover states which show Twitter and Friendfeed comments that mention that particular search result url. If you are registered with re-searchr, and have left Twitter and Friendfeed credentials, then you will see comments from your network first at the top of the rollover window, the rest of Twitter will be available in date order below that. This information is useful in a couple of ways, first it gives you an idea of what other people think of the page in question, and second it allows you to connect with people who also have interests in the page, and most likely your seach query.

Third re-search got sentiment analysis. As you can see in the picture above each comment is labeled either with a smiley face (good sentiment), a blank face (no sentiment), or bad sentiment (frowny face). The sentiments are also rolled up and shown in the meta-data next to the search result urls. The sentiments are used to affect the ranking of the search result, in the same vein as above, sentiment data from your social networks has much more effect on rankings than random sentiment.

Finally re-search has started using GetSatisfaction to manage any problems, issues, or ideas that anyone has with the system. This service is much better than the old contact page/system, so I encourage everyone to leave feedback whenever something breaks, or with ideas on how to improve things on the site.

Re-searchr in the news

I am excited to say that a couple weeks ago, right after Minnedemo, I got contacted by Finance and Commerce about an article they were writing on leveraging social networks. I had great discussions with the author Arundhati Parmar, and the article is now hot off the presses (proverbially of course):

http://www.finance-commerce.com/article.cfm/2009/02/17/New-applications-leverage-the-popularity-of-social-networks

Saturday, February 7, 2009

Re-searchr at Minnedemo

Here is a video of me demoing re-searchr at Minnedemo, February 6th, 2009 in Minneapolis, Minnesota.