Thrudb on EC2: A step-by-step guide
Open source document storage suite Thrudb provides a compelling system for building cheap, scalable document data storage. It lends itself nicely to running on Amazon EC2, backed by Amazon S3, with indexing provided by Lucene.
Four services make up the Thrudb suite: Thrudoc for document storage, Thrucene for indexing, Thruqueue for persistent message queuing, and Throxy for load balancing. Best of all, these services are exposed via Thrift, an open source cross-language communication library, which means Thrudb can be used natively from C++, Java, Python, PHP, and Ruby.
Probably the biggest hurdle for using Thrudb is getting all of the dependencies in order and building it. To get up and running quickly, the folks at AideRSS have made public Amazon EC2 AMIs built on CentOS that include Thrudb all installed and ready to use. If you’re looking to get going quickly, those AMIs are a great place to start.
But the AideRSS folks don’t explain how they built the AMI, so while it’s fantastic for getting started quickly, just using the AMIs won’t get one much closer to actually understanding what all it takes to get Thrudb up and running.
Below are step-by-step instructions to go from zero to running Thrudb for less than a dollar.
Setup
Note: commands always start with a prompt: $ on the local machine, and # on the EC2 instance. Any command that doesn’t appear to start with a prompt is probably line wrapping and should be entered as part of the preceding line.
We’ll start with an AMI for Ubuntu 7.10 on a small instance type. Theoretically, the process should be about the same for a medium or large instance, just with a different initial AMI.
Before we begin, you need to have installed and properly configured the Amazon EC2 command line tools. Open a terminal and navigate to your EC2 tools folder. Now we’re ready to start!
Creating an EC2 Instance
First, we need to create a new EC2 instance (server). For that, we need a keypair to use for logging in to our instance, which we create with:
$ ec2-add-keypair thrudbkp
Which will output something like:
KEYPAIR thrudbkp 97:2a:ab:87:fe:9b:0c:5b:46:b3:f9:.......
—–BEGIN RSA PRIVATE KEY—–
HenECthNUE9E9UHehNEehOU9EOhxOEuK/cccw/NYEDH9/3AIZZ2M/a2mccecUNEehEqxi3Zv
<……… a bunch of stuff ……..>
fL7XiYde/Svt3Ms7+XNetnOOnuhoeNEj/dA6wlcau98cf5NQnyJtFF9QzPvN/yclao9uWV
—–END RSA PRIVATE KEY—–
Copy all but the first line (everything between and including the BEGIN/END RSA PRIVATE KEY lines), and paste it in to a file called thrudbkp in the current directory.
Note: if you’re using a *NIX type OS, set the the permissions of your new thrudbkp file to 600:
$ chmod 600 thrudbkp
Next we create our Ubuntu 7.10 instance from ami-e2df3a8b, specifying our private key file, thrudbkp, that we’ll need when we login:
$ ec2-run-instances ami-e2df3a8b -k thrudbkp
It will take a minute or so for our instance to come up. We can check the status with:
$ ec2-describe-instances
Once the status changes from pending to something like ec2-XX-XX-XX-X.X-X.compute-X.amazonaws.com, our instance is ready to use! By default, instances are firewalled, so we have to open up port 22 to allow us to SSH into it:
$ ec2-authorize default -p 22
Now we can SSH to our new server:
$ ssh -i thrudbkp root@ec2-XX-XX-XX-X.X-X.compute-X.amazonaws.com (replace with your actual hostname from ec2-describe-instances)
Note: If you’re using a graphical SSH client, you just need to specify your thrudbkp file as a “private key” or “identity” file.
Initial Setup on EC2
Alright, we’re ready to start installing. Most of the dependencies are available with apt-get, but some we’ll have to build from source. The first thing we need for thrudb is thrift, which itself has quite a few dependencies.
First let’s make a directory to put our files in:
# mkdir buildthrudb
# cd buildthrudb
Next we’ll update apt-get and install our build tools and thrift dependencies:
# apt-get update
# apt-get -y install subversion g++ make flex bison python-dev libboost-dev libevent-dev automake pkg-config libtool
Also, since we’ll need it later, let’s update CPAN, the Perl module install tool now:
# perl -MCPAN -e "install Bundle::CPAN"
CPAN takes 10 minutes or so to update. When prompted about “manual configuration”, enter no. Eventually you’ll get 3 more prompts, to which the defaults are fine (Update configuration for libnet: no, Perl expression: exit, YAML.pm warning: y). Just hit return at each prompt.
Thrift
Now let’s grab thrift from SVN:
# svn co http://svn.facebook.com/svnroot/thrift/trunk/ thrift
# cd thrift
Then build and install it (don’t worry about the bootstrap.sh warnings):
# cd thrift
# ./bootstrap.sh
# ./configure
# make
# make install
Again, this will take a few minutes.
Thrift Client Libraries
Now that thrift is installed, we need to install the client libraries for whichever language(s) that we’re planning on using. The C++ and Python libraries are installed by default, but this guide will focus on Java and Perl as examples. If you get other client libraries working, leave a comment with the steps taken and I’ll amend this post.
For Java, first we need to update Java and ant:
# apt-get -y install sun-java5-jdk ant
This will take another 5 minutes or so. When prompted with the Java license, hit <tab><enter> twice. Now we’ll build and install with:
# cd lib/java
# ant install
# cd ../..
This will install the thrift JAR file to /usr/local/lib/libthrift.jar. For the Perl client libraries, we have another dependency to install:
# perl -MCPAN -e "install Bit::Vector"
Enter yes at the first prompt, then accept the defaults for the dozen or so prompts that follow. Now we can build and install:
# cd lib/perl
# perl Makefile.PL
# make
# make install
# cd ../..
Alright, done with thrift.
Thrudb Dependencies
Let’s go back to our build dir:
# cd ..
And start on the other dependencies for thrudb:
# apt-get -y install memcached libexpat1-dev libclucene-dev libspread1-dev libssl-dev libcurl4-openssl-dev liblog4cxx9-dev uuid-dev
We need Brackup, which has a couple of dependencies. CPAN seems to have trouble installing them all in one go, but one at a time seems to work:
# perl -MCPAN -e "install DBI"
# perl -MCPAN -e “install DBD::SQLite”
# perl -MCPAN -e “install Brackup”
These will take a few minutes each. Accept all defaults when prompted.
There are 2 dependencies that we’ll install from source: libmemcached and Spread (since, as of this writing, apt-get has Spread 3.x and we need Spread 4.x).
libmemcached
Now we’ll get, build, and install libmemcached:
# curl http://download.tangent.org/libmemcached-0.12.tar.gz | tar xzf -
# cd libmemcached-0.12
# ./configure
# make
# make install
# cd ..
Spread
Spread requires filling in a form to download it, which will be tricky from the command line, but curl to the rescue (replace the values of name, company, and email in the first command with your own information):
# curl -L -d FILE=spread-src-4.0.0.tar.gz -d name="EC2 User" -d company="Amazon EC2 User" -d email="unknown@amazon-ec2-user.com" -d Stage=Download http://www.spread.org/download/spread-src-4.0.0.tar.gz | tar xzf -
# cd spread-src-4.0.0
# ./configure
# make
# make install
# cd ..
Update Shared Libraries
For linking to work right later, we need to update our shared libraries with:
# /sbin/ldconfig
Install Thrudb
Well, if you’ve made it this far, congratulate yourself. We’re now ready to actually install thrudb! First we’ll get it from SVN:
# svn checkout http://thrudb.googlecode.com/svn/trunk/ thrudb
# cd thrudb
And prepare to configure:
# ./autogen.sh
If we just run ./configure at this point, we’ll get the following error:
configure: error: CLucene.h missing; please add clucene development files
The problem isn’t actually a problem with CLucene.h, it’s with clucene-config.h, which is expected to be in /usr/include/CLucene/, but for some reason is actually in /usr/lib/CLucene/. To fix it, we could move (or copy) the file, but in the interest of avoiding such manually hackery, we’ll just deal with it by adding CPPFLAGS=-I/usr/lib before ./configure:
# CPPFLAGS=-I/usr/lib ./configure
# make
# make install
Thrudb Client Libraries and Tutorials
And finally, we just need to build the thrudb client libraries (similar to what we did for thrift), and tutorials if we want to test it out.
To build the client libraries, you just need to run /usr/local/bin/thrift on Thrudoc.thrift and Thrucene.thrift and specify which language to generate. However, there’s a handy Makefile in the tutorial directory that will take care of this for us:
# cd tutorials
# make
Now we’re ready to test it out. First, start memcached (even if we aren’t planning on using it, it must be running or the thrudoc server will crash):
# /etc/init.d/memcached start
Then start thrudoc and thrucene with the handy control script:
# ./thrudbctl start
At last! Let’s build and run the Java tutorial:
# cd java
# ant
# ./run_tutorial.sh
# cd ..
You should see something like:
java -cp tutorial.jar:/usr/local/lib/libthrift.jar BookmarkExample $*
*Indexed file in: 398ms*
Searching for: 'tags:(+css +examples)' random:true
Found 3 bookmarks
1 title : Uni-Form
url : (http://dnevnikeklektika.com/uni-form/)
tags : (examples CSS)
2 title : Dynamic Drive CSS Library- Practical CSS codes and examples
url : (http://www.dynamicdrive.com/style/)
tags : (css examples)
3 title : Dynamic Drive DHTML Scripts -DD Tab Menu (5 styles)
url : (http://www.dynamicdrive.com/dynamicindex1/ddtabmenu.htm)
tags : (cool css examples menu)
Took: 73ms
Searching for: 'title:(linux)' sortby:title
Found 4 bookmarks
1 title : Debian GNU/Linux System Administration Resources
url : (http://www.debian-administration.org/)
tags : (linux administration tips)
2 title : Linux Scalability
url : (http://www.cs.wisc.edu/condor/condorg/linux_scalability.html)
tags : (linux sysadmin ulimit)
3 title : Set Up Postfix For Relaying Emails Through Another Mailserver | HowtoForge - Linux Howtos and Tutorials
url : (http://www.howtoforge.com/postfix_relaying_through_another_mailserver)
tags : (email linux server)
4 title : ZFS on FUSE/Linux
url : (http://zfs-on-fuse.blogspot.com/)
tags : (zfs linux fuse)
Took: 5ms
*Index cleared in: 36ms*
The Perl tutorial needs one more dependency:
# perl -MCPAN -e "install Class::Accessor"
Again, accept the default when prompted. Then run with:
# cd perl
# perl BookmarkExample.pl
You should see output similar to the Java example.
Next Steps
Congratulations! You now have thrudoc and thrucene running on your own EC2 instance. From here, you can poke around in the tutorial directory and have a look at the *.conf files. When you’re done, disconnect from your EC2 instance with:
# exit
To make use of thrudb from a different server, you need the thrift and thrudb client libraries for your language. For example, for Java:
$
Ilya Grigorik said,
Wrote on January 8, 2008 @ 7:32 am
Awesome guide, thanks for the detailed writeup!
Jake Luciani said,
Wrote on January 8, 2008 @ 4:11 pm
Thanks for the great guide!
I’ve added it to the thrudb wiki
http://code.google.com/p/thrudb/wiki/InstallingThrudb
Thai Duong said,
Wrote on March 19, 2008 @ 12:32 pm
Some parts of this guide is outdated. Please go to http://thrudb.org/wiki/UbuntuInstallationGuide for the up2date information.