Monday, March 26, 2012

Dynamic DNS updates with nsupdate and BIND 9

I first saw nsupdate mentioned on the devops-toolchain mailing list as a tool for dynamically updating DNS zone files from the command line. Since this definitely beats manual editing of zone files, I'd thought I'd give it a try. My setup is BIND 9 on Ubuntu 10.04. I won't go into the details of setting up BIND 9 on Ubuntu -- see a good article about this here.

It took me a while to get nsupdate to work. There are lots of resources out there, but as usual it's hard to separate the grain from the chaff. When everything was said and done, the solution was relatively simple. Here it is.

Generate TSIG keys

dnssec-keygen -r /dev/urandom -a HMAC-MD5 -b 512 -n HOST myzone.com


This generates 2 files of the form:

-rw-------   1 root bind   122 2012-03-21 15:47 Kmyzone.com.+157+02058.key
-rw-------   1 root bind   229 2012-03-21 15:47 Kmyzone.com.+157+02058.private

Note that I specified /dev/urandom as the source of randomness, which may not meet your security requirements. When I didn't specify the -r /dev/urandom parameter, the dnssec-keygen command appeared to hang.

Also note that the type of the key needs to be HOST (specified via -n HOST).

Add key to DNS master server configuration and allow updates

I modified /etc/bind/named.conf.local and added a 'key' section:

key "myzone.com." {
 algorithm hmac-md5;
 secret "JKlA76FvmGboEQ8R2yoc9AtpFqkIncH5yf2mXY+s8m6a/RRC0thUVGnqrJSO1QKhzlnkbxTjmArap+WuVW9iLQ==";
};

The key name can be anything you want. The secret is the actual key, which can be found in both of the files generated by dnssec-keygen.

I also added an allow-update directive to the zone that I wanted to modify via nsupdate. This is still in /etc/bind/named.conf.local:

zone "myzone.com" {
       type master;
       file "/var/lib/bind/myzone.com.db";
       allow-update { key "myzone.com."; };
};

I then restarted bind9 on the master DNS server via 'service bind9 restart'. I checked /var/log/daemon.log to make sure there were no errors during the restart.

Note that you can use a more finely grained control over which operations you allow for the updates. See the 'Allowing Updates' section in this 'Secure DDNS Howto' document.

Use nsupdate to do remote updates

On a remote trusted host of your choice, copy the private file generated by dnssec-keygen, and create a file containing the desired updates to the zone file on the master. This file is of the form:

# cat nsupdate.txt
server master.dns.server.myzone.com
debug yes
zone myzone.com.
update add testnsupdate.myzone.com. 86400 CNAME ns1
show
send


Then run nsupdate and specify the kddey and the file you just created:

# nsupdate -k Kmyzone.com.+157+02058.private -v nsupdate.txt

If everything goes well, you should see something like this in the debug output of nsupdate (because we specified 'debug yes' in the nsupdate.txt file):

;; UPDATE SECTION:
testnsupdate.myzone.com. 86400 IN CNAME ns1

;; TSIG PSEUDOSECTION:
myzone.com. 0 ANY TSIG hmac-md5.sig-alg.reg.int. 1332788750 300 16 UxiMG7+X2RejWzQ9rkuPaQ== 3305 NOERROR 0 

Reply from update query:
;; ->>HEADER<<- opcode: UPDATE, status: NOERROR, id:   3305
;; flags: qr ra ; ZONE: 0, PREREQ: 0, UPDATE: 0, ADDITIONAL: 1
;; TSIG PSEUDOSECTION:
myzone.com. 0 ANY TSIG hmac-md5.sig-alg.reg.int. 1332788857 300 16 KBubhEggwBHnPlbmlQ7iTw== 3305 NOERROR 0 

On the master DNS server, you should see something like this in /var/log/daemon.log:

Mar 26 12:07:37 dns01 named[14952]: client 10.0.10.133#58265: signer "myzone.com" approved
Mar 26 12:07:37 dns01 named[14952]: client 10.0.10.133#58265: updating zone 'myzone.com/IN': adding an RR at 'testnsupdate.myzone.com' CNAME
Mar 26 12:07:37 dns01 named[14952]: zone myzone.com/IN: sending notifies (serial 2012032104)
Mar 26 12:07:37 dns01 named[14952]: client 10.0.10.121#50790: transfer of 'myzone.com/IN': IXFR started
Mar 26 12:07:37 dns01 named[14952]: client 10.0.10.121#50790: transfer of 'myzone.com/IN': IXFR ended

One other important note: the modifications made with nsupdate take effect immediately on the DNS master server (and they also get pushed from there to slave servers), but they are not written immediately to the actual DNS zone file on disk on the master server. Instead, a journal file is used, in the same directory as the zone file. The journal entries get applied periodically to the main zone file. If you restart bind9, the journal entries also get applied.

That's about it. If everything went well, you now have an API of sorts into your Bind 9 server. Now go automate all the things!

More resources:

Thursday, March 08, 2012

PostgreSQL dump/restore and client_encoding

I started to look into EnterpriseDB recently. Pretty pleased with it so far. At first I launched the beta version of their Postgres Plus Cloud Database product, but since this version is in the process of being decomissioned, I've had to transfer the database I had already created to a newly created cluster in their DBaaS model -- which basically means that the cluster manager is maintained by them, and the cluster member servers (1 master + N replicas) are part of your EC2 footprint.

In any case, I did a pg_dump of the database from the initial master, then I tried to load the dump via psql into a newly created database on the new master. However, the client_encoding parameter in postgresql.conf was SQL_ASCII on the first master, and UTF8 on the second. This posed a problem. The psql load operation failed with errors of the type:

ERROR:  invalid byte sequence for encoding "UTF8": 0xe92044
CONTEXT:  COPY table1, line 6606
ERROR:  invalid byte sequence for encoding "UTF8": 0xa0
CONTEXT:  COPY table2, line 978
ERROR:  invalid byte sequence for encoding "UTF8": 0xd454
CONTEXT:  COPY table3, line 3295


Obviously the encodings didn't match. I first tried to re-run the pg_dump on the first master (which had client_encode = 'SQL_ASCII') and specified "--encoding utf8" on the pg_dump command line. This didn't do the trick. I had the same exact errors when loading the dump on the second master.

One solution suggested by EnterpriseDB was to set client_encoding to SQL_ASCII on the new master, restart Postgres and retry the load. I found another solution though in a blog post very aptly titled (for my purposes) 'PostgreSQL database migration, the SQL_ASCII to UTF8 problem'. What I ended up doing, following the advice in the post, was to install the GNU recode utility (I did a yum install recode), then run the initial dump through recode, converting it from ascii to utf8. Something like this:

cat dump.sql | recode iso-8859-1..u8 > utf8_withrecode.sql

Then I modified the line

SET client_encoding = 'SQL_ASCII';

and turned it into:

SET client_encoding = 'UTF8';

after which loading the dump into the new master with psql worked like a charm.

Anyway...now for the fun part of doing some load testing against this Postgres cluster!

Tuesday, February 21, 2012

Set operations in Apache Pig


Simple set operation examples

While writing Apache Pig scripts, I realized that in many cases the result I was after was attainable through a series of set operations performed on various relations. It’s not very clear from the documentation how to perform these operations. I googled a bit and found this PDF from Tufts University on ‘Advanced Pig’. In a nutshell, COGROUP is your friend. Here are some simple examples that show how you can perform set operations in Pig using COGROUP.

Let’s assume we have 2 relations TEST1 and TEST2. We load TEST1 from test1.txt containing:


1,aaaa
2,bbbb
3,cccc
4,dddd
5,eeee
6,ffff
7,gggg
8,hhhh
9,iiii

TEST1 = LOAD 's3://mybucket/test1.txt' USING PigStorage(',') as (
id: chararray,
value: chararray);

We load TEST2 from test2.txt containing:

7,ggggggg
8,hhhhhhh
9,iiiiiii
10,jjjjjjj
11,kkkkkkk

TEST2 = LOAD 's3://mybucket/test2.txt' USING PigStorage(',') as (
id: chararray,
value: chararray);


We use COGROUP to generate a new relation. COGROUP is similar to JOIN, in that it takes one or more fields of each of its member relations. Here is how we cogroup based on the id field of TEST1 and also id of TEST2:

CGRP = COGROUP TEST1 BY id, TEST2 BY id;
DUMP CGRP;

(1,{(1,aaaa)},{})
(2,{(2,bbbb)},{})
(3,{(3,cccc)},{})
(4,{(4,dddd)},{})
(5,{(5,eeee)},{})
(6,{(6,ffff)},{})
(7,{(7,gggg)},{(7,ggggggg)})
(8,{(8,hhhh)},{(8,hhhhhhh)})
(9,{(9,iiii)},{(9,iiiiiii)})
(10,{},{(10,jjjjjjj)})
(11,{},{(11,kkkkkkk)})

If we DESCRIBE the new relation CGRP we get:

CGRP: {group: chararray,TEST1: {(id: chararray,value: chararray)},TEST2: {(id: chararray,value: chararray)}}

What is important to notice is that the second element of each tuple from the new relation is a bag of tuples from TEST1 containing the id value by which we cogrouped, and the third element of each tuple is a bag of tuples from TEST2 containing that same id value. These bags are empty if TEST1 or TEST2 do not contain a given id value. Based on this, we can perform the set operations I mentioned.

To perform set intersection (based on the id field), we only keep those tuples which have non-empty bags for both TEST and TEST2:


INTERSECT = FILTER CGRP BY NOT IsEmpty(TEST1) AND NOT IsEmpty(TEST2);
INTERSECT_ID = FOREACH INTERSECT GENERATE group AS id;   
DUMP INTERSECT_ID;

(7)
(8)
(9)


To perform the set difference TEST1 - TEST2 (based again on the id field), we keep only those tuples which have empty bags for TEST2 (which means those particular id values are in TEST1, but not in TEST2:

TEST1_MINUS_TEST2 = FILTER CGRP BY IsEmpty(TEST2);
TEST1_MINUS_TEST2_ID = FOREACH TEST1_MINUS_TEST2 GENERATE group AS id;
DUMP TEST1_MINUS_TEST2_ID;

(1)
(2)
(3)
(4)
(5)
(6)


The difference going the other way (TEST2 - TEST1) is similar. We keep only those tuples which have empty bags for TEST1:

TEST2_MINUS_TEST1 = FILTER CGRP BY IsEmpty(TEST1);
TEST2_MINUS_TEST1_ID = FOREACH TEST2_MINUS_TEST1 GENERATE group AS id;
DUMP TEST2_MINUS_TEST1_ID;

(10)
(11)


Note that if we wanted the set union based on the id field, we could simply generate the ‘group’ element of the CGRP relation:

UNION_ID = FOREACH CGRP GENERATE group AS id;
DUMP UNION_ID;

(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)


To perform the set intersection operation, we could also do a JOIN of TEST1 and TEST2 on the id field:

J = JOIN TEST1 BY id, TEST2 BY id;
DESCRIBE J;
DUMP J;

J: {TEST1::id: chararray,TEST1::value: chararray,TEST2::id: chararray,TEST2::value: chararray}

(7,gggg,7,ggggggg)
(8,hhhh,8,hhhhhhh)
(9,iiii,9,iiiiiii)


After the JOIN, we keep only the first field of the J relation (the id field):

J_ID = FOREACH J GENERATE $0;
DUMP J_ID;

(7)
(8)
(9)


To perform the set union operation, we could do a UNION of TEST1 and TEST2:

U = UNION TEST1, TEST2;
DESCRIBE U;
DUMP U;

U: {id: chararray,value: chararray}

(1,aaaa)
(2,bbbb)
(3,cccc)
(4,dddd)
(5,eeee)
(6,ffff)
(7,gggg)
(8,hhhh)
(9,iiii)
(7,ggggggg)
(8,hhhhhhh)
(9,iiiiiii)
(10,jjjjjjj)
(11,kkkkkkk)


However, note that the tuples containing common id values (7, 8 and 9) are duplicated at this point. So to generate the true set union, we need to keep only the distinct id values:

U_ID = FOREACH U GENERATE $0;
U_ID_DISTINCT = DISTINCT U_ID;
DUMP U_ID_DISTINCT;

(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)

More ‘real life’ set operation examples

The following examples slightly more realistic. At least they’re based on real data -- the GeoWordNet datasets. As stated in this “Background Knowledge Datasets” document:

“A geo-spatial ontology is an ontology consisting of geo-spatial classes (e.g. lake, city), entities (e.g., Lago di Molveno, Trento), their metadata (e.g. latitude and longitude coordinates) and relations between them (e.g., part-of, instance-of). GeoWordNet is a multilingual geo-spatial ontology built from the full integration of WordNet, GeoNames and the Italian part of MultiWordNet.”

The GeoWordNet dataset contains several CSV files which can be either imported in a relational database, or, in our case, loaded into Pig as relations:

concept = LOAD 's3://mybucket/geowordnet/concept.csv.gz' USING PigStorage(',') as (
 con_id: int,
 name: chararray,
 gloss:chararray,
 lang: chararray,
 provenance: chararray);

relation= LOAD 's3://mybucket/geowordnet/relation.csv.gz' USING PigStorage(',') as (
         src_con_id: int,
         trg_con_id: int,
         name: chararray,
         gloss:chararray,
         lang: chararray);

entity = LOAD 's3://mybucket/geowordnet/entity.csv.gz' USING PigStorage(',') as (
         entity_id: int,
         name: chararray,
         con_id: int,
         lang: chararray,
         latitude: chararray,
         longitude: chararray,
         provenance: chararray);

part_of = LOAD 's3://mybucket/geowordnet/part_of.csv.gz' USING PigStorage(',') as (
         src_entity_id: int,
         trg_entity_id: int);

alternative_name_eng = LOAD 's3://mybucket/geowordnet/alternative_name_eng.csv.gz' USING PigStorage(',') as (
         entity_id: int,
         name: chararray);

alternative_name_ita = LOAD 's3://mybucket/geowordnet/alternative_name_ita.csv.gz' USING PigStorage(',') as (
         entity_id: int,
         name: chararray);


Example 1

-- Find entities with both alternative english AND italian names
COGRP1 = COGROUP alternative_name_eng BY entity_id, alternative_name_ita BY entity_id;
INTERSECT = FILTER COGRP1 BY NOT IsEmpty(alternative_name_eng) AND NOT IsEmpty(alternative_name_ita);
R1 = FOREACH INTERSECT GENERATE FLATTEN(alternative_name_eng), FLATTEN(alternative_name_ita);


Example 2

-- Find entities with alternative english names but with no alternative italian names
COGRP2 = COGROUP alternative_name_eng BY entity_id, alternative_name_ita BY entity_id;
DIFF2 = FILTER COGRP2 BY IsEmpty(alternative_name_ita);
R2 = FOREACH DIFF2 GENERATE FLATTEN(alternative_name_eng);


Example 3

-- Find entities with alternative italian names but with no alternative english names
COGRP3 = COGROUP alternative_name_ita BY entity_id, alternative_name_eng BY entity_id;
DIFF3 = FILTER COGRP3 BY IsEmpty(alternative_name_eng);
R3 = FOREACH DIFF3 GENERATE FLATTEN(alternative_name_ita);

Example 4

-- Find entities with alternative english OR italian names
U = UNION alternative_name_eng, alternative_name_ita;
J = JOIN entity BY entity_id, U BY entity_id;
R4 = FOREACH J GENERATE entity::name, entity::con_id, entity::lang, entity::latitude, entity::longitude, U::name;

Example 5

-- Find entities with NO alternative english and NO italian names (by doing set difference)
COGRP5 = COGROUP entity BY entity_id, U BY entity_id;
DIFF5 = FILTER COGRP5 BY IsEmpty(U);
R5 = FOREACH DIFF5 GENERATE FLATTEN(entity);


Although not strictly set-operation-related, here are some more things you can find out from the GeoWordNet dataset by means of JOINs between the appropriate relations:

-- Find relations between concepts
J1 = JOIN concept BY con_id, relation BY src_con_id;
J2 = JOIN J1 by trg_con_id, concept by con_id;
R6 = FOREACH J2 GENERATE J1::concept::con_id, J1::concept::name, J1::concept::gloss, J1::concept::lang, J1::concept::provenance, J1::relation::src_con_id, J1::relation::trg_con_id, J1::relation::name, J1::relation::gloss, J1::relation::lang, concept::con_id, concept::name, concept::gloss, concept::lang, concept::provenance;

-- Find entities which are part of other entities
J3 = JOIN entity BY entity_id, part_of BY src_entity_id;
J4 = JOIN J3 by trg_entity_id, entity by entity_id;
R7 = FOREACH J4 GENERATE J3::entity::name, J3::entity::con_id, J3::entity::lang, J3::entity::latitude, J3::entity::longitude, 'is part of', entity::name, entity::con_id, entity::lang, entity::latitude, entity::longitude;


Thursday, February 09, 2012

Handling date/time in Apache Pig

A common usage scenario for Apache Pig is to analyze log files. Most log files contain a timestamp of some sort -- hence the need to handle dates and times in your Pig scripts. I'll present here a few techniques you can use.

Mail server logs

The first example I have is a Pig script which analyzes the time it takes for a mail server to send a message. The script is available here as a gist.

We start by registering the piggybank jar and defining the functions we'll need. I ran this using Elastic MapReduce, and all these functions are available in the piggybank that ships with EMR.

REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();             
DEFINE CustomFormatToISO org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix();
DEFINE DATE_TIME org.apache.pig.piggybank.evaluation.datetime.DATE_TIME();
DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT();
DEFINE FORMAT org.apache.pig.piggybank.evaluation.string.FORMAT();


Since the mail log timestamps don't contain the year, we declare a variable called YEAR which by default is set to the current year via the Unix 'date' command. The variable can also be set when the Pig script is called by running "pig -p YEAR=2011 mypigscript.pig".

%default YEAR `date +%Y`;

We read in the mail logs and extract the lines containing the source of a given message ('from' lines). An example of such a line:

Dec  2 15:13:52 mailserver1 sendmail[1882]: pB2KCqu1001882: from=<[email protected]>, size=9544, class=0, nrcpts=1, msgid=<[email protected]>, proto=ESMTP, daemon=MTA, relay=relay1.example.com [10.0.20.6]

To split the line into its various elements, we use the EXTRACT function and a complicated regular expression. Note that in Pig the backslash needs to be escaped:

RAW_LOGS = LOAD '$INPUT' as (line:chararray);
SRC = FOREACH RAW_LOGS GENERATE                                                 
FLATTEN(                                                                         
 EXTRACT(line, '(\\S+)\\s+(\\d+)\\s+(\\S+)\\s+(\\S+)\\s+sendmail\\[(\\d+)\\]:\\s+(\\w+):\\s+from=<([^>]+)>,\\s+size=(\\d+),\\s+class=(\\d+),\\s+nrcpts=(\\d+),\\s+msgid=<([^>]+)>.*relay=(\\S+)')
)
AS (
 month: chararray,
 day: chararray,
 time: chararray,
 mailserver: chararray,
 pid: chararray,
 sendmailid: chararray,
 src: chararray,
 size: chararray,
 classnumber: chararray,
 nrcpts: chararray,
 msgid: chararray,
 relay: chararray
);

For this particular exercise we don't need all the fields of the SRC relation. We keep only a few:

T1 = FOREACH SRC GENERATE sendmailid, FORMAT('%s-%s-%s %s', $YEAR, month, day, time) as timestamp;
FILTER_T1 = FILTER T1 BY NOT sendmailid IS NULL;
DUMP FILTER_T1;

Note that we use the FORMAT function to generate a timestamp string out of the month, day and time fields, and we also add the YEAR variable. The FILTER_T1 relation contains tuples such as:

(pB2KDpaN007050,2011-Dec-2 15:13:52)
(pB2KDpaN007054,2011-Dec-2 15:13:53)
(pB2KDru1003569,2011-Dec-2 15:13:54)

We now use the DATE_TIME function which takes as input our generated timestamp and the date format string representing the timestamp ('yyyy-MMM-d HH:mm:ss'), and returns a DateTime string in Joda-Time format/ ISO 8601 format.

R1 = FOREACH FILTER_T1 GENERATE sendmailid, DATE_TIME(timestamp, 'yyyy-MMM-d HH:mm:ss') as dt;
DUMP R1;

The R1 relation contains tuples such as:

(pB2KDpaN007050,2011-12-02T15:13:52.000Z)
(pB2KDpaN007054,2011-12-02T15:13:53.000Z)
(pB2KDru1003569,2011-12-02T15:13:54.000Z)

Note that the timestamp string "2011-Dec-2 15:13:52" got converted into a canonical ISO 8601 DateTime string "2011-12-02T15:13:52.000Z".

Now we can operate on the DateTime strings by using the ISOToUnix function, which takes a DateTime and returns the Unix epoch in milliseconds (which we divide by 1000 to obtain seconds):

-- ISOToUnix returns milliseconds, so we divide by 1000 to get seconds
toEpoch1 = FOREACH R1 GENERATE sendmailid, dt, ISOToUnix(dt) / 1000 as epoch:long;
DUMP toEpoch1;

The toEpoch1 relation contains tuples of the form:

(pB2KDpaN007050,2011-12-02T15:13:52.000Z,1322838832)
(pB2KDpaN007054,2011-12-02T15:13:53.000Z,1322838833)
(pB2KDru1003569,2011-12-02T15:13:54.000Z,1322838834)

We now perform similar operations on lines containing destination email addresses:

DEST = FOREACH RAW_LOGS GENERATE                                                 
FLATTEN(                                                                         
 EXTRACT(line, '(\\S+)\\s+(\\d+)\\s+(\\S+)\\s+(\\S+)\\s+sendmail\\[(\\d+)\\]:\\s+(\\w+):\\s+to=<([^>]+)>,\\s+delay=([^,]+),\\s+xdelay=([^,]+),.*relay=(\\S+)\\s+\\[\\S+\\],\\s+dsn=\\S+,\\s+stat=(.*)')
)
AS (
 month: chararray,
 day: chararray,
 time: chararray,
 mailserver: chararray,
 pid: chararray,
 sendmailid: chararray,
 dest: chararray,
 delay: chararray,
 xdelay: chararray,
 relay: chararray,
 stat: chararray
);


T2 = FOREACH DEST GENERATE sendmailid, FORMAT('%s-%s-%s %s', $YEAR, month, day, time) as timestamp, dest, stat;
FILTER_T2 = FILTER T2 BY NOT sendmailid IS NULL;

R2 = FOREACH FILTER_T2 GENERATE sendmailid, DATE_TIME(timestamp, 'yyyy-MMM-d HH:mm:ss') as dt, dest, stat;

-- ISOToUnix returns milliseconds, so we divide by 1000 to get seconds
toEpoch2 = FOREACH R2 GENERATE sendmailid, dt, ISOToUnix(dt) / 1000 AS epoch:long, dest, stat;

At this point we have 2 relations, toEpoch1 and toEpoch2, which we can join by sendmailid:

R3 = JOIN toEpoch1 BY sendmailid, toEpoch2 BY sendmailid;

The relation R3 will contain tuples of the form

(sendmailid, datetime1, epoch1, sendmailid, datetime2, epoch2, dest, stat)

We generate another relation by keeping the sendmailid, the delta epoch2 - epoch1, the destination email and the status of the delivery. We also order by the epoch delta:

R4 = FOREACH R3 GENERATE $0, $5 - $2, $6, $7;
R5 = ORDER R4 BY $1 DESC;

R5 contains tuples such as:

(pB2KDqo5007488,2,[email protected],Sent (1rwzuwyl3Nl36v0 Message accepted for delivery))
(pB2KDru1003560,1,[email protected],Sent (ok dirdel))
(pB2KCrvm030964,0,[email protected],Sent ( <201112022012.pB2KCrvm030964> Queued mail for delivery))

At this point we can see which email deliveries took longest, and try to identify patterns (maybe certain mail domains make it harder to deliver messages, or maybe email addresses are misspelled, etc).

Nginx logs

In the second example, I'll show how to do some date conversions on Nginx access log timestamps. The full Pig script is available here as a gist.

We parse the Nginx access log lines similarly to the mail log lines in the first example:

RAW_LOGS = LOAD '$INPUT' as (line:chararray);
LOGS_BASE = FOREACH RAW_LOGS GENERATE                                            
FLATTEN(                                                                         
 EXTRACT(line, '(\\S+) - - \\[([^\\[]+)\\]\\s+"([^"]+)"\\s+(\\d+)\\s+(\\d+)\\s+"([^"]+)"\\s+"([^"]+)"\\s+"([^"]+)"\\s+(\\S+)')
)
AS (
 ip: chararray,
 timestamp: chararray,
 url: chararray,
 status: chararray,
 bytes: chararray,
 referrer: chararray,
 useragent: chararray,
 xfwd: chararray,
 reqtime: chararray
);
DATE_URL = FOREACH LOGS_BASE GENERATE timestamp;
F = FILTER DATE_URL BY NOT timestamp IS NULL;

The timestamp is of the form "30/Sep/2011:00:10:02 -0700" so we use the appropriate DATE_TIME formatting string 'dd/MMM/yyyy:HH:mm:ss Z' to convert it to an ISO DateTime. Note that we need to specify the timezone with Z:

R1 = FOREACH F GENERATE timestamp, DATE_TIME(timestamp, 'dd/MMM/yyyy:HH:mm:ss Z') as dt;
DUMP R1;

R1 contains tuples of the form:

(30/Sep/2011:00:19:35 -0700,2011-09-30T00:19:35.000-07:00)
(30/Sep/2011:00:19:36 -0700,2011-09-30T00:19:36.000-07:00)
(30/Sep/2011:00:19:37 -0700,2011-09-30T00:19:37.000-07:00)

At this point, if we wanted to convert from DateTime to Unix epoch in seconds, we could use ISOToUnix like we did for the mail logs:

toEpoch = FOREACH R1 GENERATE dt, ISOToUnix(dt) / 1000 as epoch:long;

However, let's use another function called FORMAT_DT to convert from the above DateTime format to another format of the type 'MM/dd/yyyy HH:mm:ss Z'. The first argument to FORMAT_DT is the desired format for the date/time, and the second argument is the original DateTime format:

FDT = FOREACH R1 GENERATE FORMAT_DT('MM/dd/yyyy HH:mm:ss Z', dt) as fdt;
DUMP FDT;

The FDT relation now contains tuples such as:

(09/30/2011 00:19:35 -0700)
(09/30/2011 00:19:36 -0700)
(09/30/2011 00:19:37 -0700)

We can now use a handy function called CustomFormatToISO to convert from any custom date/time format (such as the one we generated in FDT) back to a canonical ISO DateTime format:

toISO = FOREACH FDT GENERATE fdt, CustomFormatToISO(fdt, 'MM/dd/yyyy HH:mm:ss Z');
DUMP toISO;

(09/30/2011 00:19:35 -0700,2011-09-30T07:19:35.000Z)
(09/30/2011 00:19:36 -0700,2011-09-30T07:19:36.000Z)
(09/30/2011 00:19:37 -0700,2011-09-30T07:19:37.000Z)

Note how the custom DateTime string "09/30/2011 00:19:35 -0700" got transformed into the canonical ISO DateTime string "2011-09-30T07:19:35.000Z".

Converting Unix epoch to DateTime

Some log files have timestamps in Unix epoch format. If you want to transform them into DateTime, you can use the UnixToISO function:

DEFINE UnixToISO org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO();

Here is an input file:

$ cat unixtime.txt
1320777563
1320777763
1320779563
1320787563

And here is a Pig script which converts the epoch into DateTime strings. Note that UnixToISO expects the epoch in milliseconds, and our input is in seconds, so we have to multiply each input value by 1000 to get to milliseconds:

UNIXTIMES = LOAD 's3://mybucket.com/unixtime.txt' as (unixtime:long);
D = FOREACH UNIXTIMES GENERATE UnixToISO(unixtime * 1000);
DUMP D;

(2011-11-08T18:39:23.000Z)
(2011-11-08T18:42:43.000Z)
(2011-11-08T19:12:43.000Z)
(2011-11-08T21:26:03.000Z)

Tuesday, January 24, 2012

An ode to running a database on bare metal


No, my muse is not quite as strong as to inspire me to write an ode, but I still want to emphasize a few points about the goodness of running a database on bare metal.

At Evite, we use sharded MySQL for our production database. We designed the current architecture in 2009, when NoSQL was still very much in its infancy, so MySQL seemed a solid choice, a technology that we could at least understand. As I explained elsewhere, we do use MySQL in an almost non-relational way, and we sharded from the get-go, with the idea that it's better to scale horizontally than vertically.

We initially launched with the database hosted at a data center on a few Dell PE2970 servers, each with 16 GB of RAM and 2 quad-core CPUs. Each server was running 2 MySQL instances. We didn't get a chance to dark launch, but the initial load testing we did showed that we should be OK. However, there is nothing like production traffic to really stress test your infrastructure, and we soon realized that we have an insufficient number of servers for the peak traffic we were expecting towards the end of the year.

We decided to scale horizontally in EC2, with one MySQL instance per m1.xlarge EC2 instance. At the time we also engaged Percona and they helped us fine-tune our Percona XtraDB MySQL configuration so we could get the most out of the m1.xlarge horsepower. We managed to scale sufficiently enough for our high season in 2010, although we had plenty of pain points. We chose to use EBS volumes for our database files, because at the time EBS still gave people the illusion of stability and durability. We were very soon confronted with severe performance issues, manifested as very high CPU I/O wait times, which were sometimes so high as to make the instance useless.

I described in a previous post how proficient we became at failing over from a master that went AWOL to a slave. Our issues with EBS volumes were compounded by the fact that our database access pattern is very write-intensive, and a shared medium such as EBS was far from ideal. Our devops team was constantly on the alert, and it seemed like we were always rebuilding instances and recovering from EC2 instance failures, although the end-user experience was not affected.

Long story short, we decided to bring the database back in-house, at the data center, on 'real' bare-metal servers. No virtualization, thanks. The whole process went relatively smoothly. One important point I want to make here is that we already had a year's worth of hard numbers at that point regarding the access patterns to our database, iops/sec, MySQL query types, etc, etc. So it made it easy to do proper capacity planning this time, in the presence of production traffic.

We started by buying 2 Dell C2100 servers, monster machines, with dual Intel Xeon X5650 processors (for a total of 24 cores), 144 GB RAM, and 12 x 1 TB hard disks out of which we prepared a 6 TB RAID 10 volume which we further divided in LVM logical volumes for specific types of MySQL files.

We put 2 MySQL instances on each server, and we engaged Percona again to help us fine-tune the configuration, this time including not only MySQL, but also the hardware and the OS. They were super helpful to us, as usual. Here are only some of the things they recommended, which we implemented:
  • set vm.swappiness kernel setting to 0 in /etc/sysctl.conf
  • set InnoDB flush method to O_DIRECT because we can rely on the RAID controller to do the caching (we also mounted XFS with the nobarrier option in conjunction with this change)
  • disable MySQL query cache, which uses a global mutex that can cause performance issues when used on a multi-core server
  • various other optimizations which were dependent on our setup, things like tweaking MySQL configuration options such as key_buffer_size and innodb_io_capacity
One important MySQL configuration option that we had to tweak was innodb_buffer_pool_size. If we set it too high, the server could start swapping. If we set it too low, the disk I/O on the server could become too problematic. Since we had 144 GB of RAM and we were running 2 MySQL instances per server, we decided to give each instance 60 GB of RAM. This proved to strike a good balance.

Once the fine-tuning was done, we directed production traffic away from 4 EC2 m1.xlarge instances to 2 x 2 MySQL instances, with each pair running on a C2100. We then sat back and wallowed for a while in the goodness of the I/O numbers we were observing. Basically, the servers were barely working. This is how life should be. 

We soon migrated all of our MySQL masters back into the data center. We left the slaves running in EC2 (still one m1.xlarge slave per MySQL master instance), but we changed them from being EBS-backed to using the local ephemeral disk in RAID 0 with LVM. We look at EC2 in this case as a secondary data center, used only in emergency situations.

One thing that bit us in our bare-metal setup was....a bare-metal issue around the LSI MegaRAID controllers. I already blogged about the problems we had with the battery relearning cycle, and with decreased performance in the presence of bad drives. But these things were easy to fix (again thanks to our friends at Percona for diagnosing these issues correctly in the first place...)

I am happy to report that we went through our high season for 2011 without a glitch in this setup. Our devops team slept much better at night too! One nice thing about having EC2 as a 'secondary data center' is that if need be, we can scale out horizontally   by launching more EC2 instances. In fact, we doubled the number of MySQL slave instances for the duration of our high season, with the thought that if we need to, we can double the number of shards at the application layer, and thus scale horizontally that way. We didn't have to do any tweaking fortunately, but we were able to -- a strategy which would otherwise be hard to pull off if we didn't have any cloud presence, unless we bought a lot of extra capacity at the data center.

This brings me to one of the points I want to make in this post: it is a very valuable strategy to be able to use the cloud to roll out a new architecture (which you designed from the get-go however to be horizontally scalable) and to gauge its performance in the presence of real production traffic. You will get less than optimal performance per instance (because of virtualization vs. real hardware) , but since you can scale horizontally, you should be able to sustain the desired level of traffic for your application. You will get hard numbers that will help you do capacity planning and you will be able to bring the database infrastructure back to real hardware if you so wish, like we did. Note that Zynga has a similar strategy -- they roll out new games in EC2 and once they get a handle on how much traffic a game has, they bring it back into the data center (although it looks like they still use a private cloud and not bare metal).

Another point I want to make is that the cloud is not ready yet for write-intensive transactional databases, mainly because of the very poor I/O performance that you get on virtual instances in the cloud (compounded by shared network storage such as EBS). Adrian Cockcroft will reply that Netflix is doing just fine and they're exclusively in EC2. I hope they are doing just fine, and I hope his devops team is getting some good sleep at night, but I'm not sure. I need to perhaps qualify my point and say that the cloud is not ready for traditional transactional databases such as MySQL and PostgreSQL, which require manual sharding to be horizontally scalable. If I had to look at redesigning our database architecture today, I'd definitely try out HBase, Riak and maybe Cassandra. The promise there at least is that adding a new node to the cluster in these technologies is much less painful than in the manual sharding and scaling scenario. This still doesn't guarantee that you won't end up paying for a lot of instances to compensate for poor individual I/O per instance. Maybe a cloud vendor like Joyent with their SmartMachines will make a difference in this area (in fact, it is on our TODO list to test out their Percona SmartMachine).

Note however that there's something to be said about using good ol' RDBMS technologies. Ryan Mack says this in a Facebook Engineering post:

"After a few discussions we decided to build on four of our core technologies: MySQL/InnoDB for storage and replication, Multifeed (the technology that powers News Feed) for ranking, Thrift for communications, and memcached for caching. We chose well-understood technologies so we could better predict capacity needs and rely on our existing monitoring and operational tool kits."

The emphasis on the last sentence is mine. It's the operational aspect of a new architecture that will kill you first. With a well understood architecture, at least you have a chance to tame it.

Yet another point I'd like to make is: do not base your disaster recovery strategy in EC2 around EBS volumes, especially if you have a write-intensive database. It's not worth the performance loss, and most of all it's not worth the severe and unpredictable fluctuation in performance. It works much better in our experience to turn the ephemeral disks of an m1.xlarge EC2 instance into a RAID 0 array and put LVM on top of that, and use it for storing the various MySQL file types. We are then able to do LVM snapshots of that volume, and upload the snapshots to S3. To build a new slave, we can restore the snapshot from S3, then catch up the replication with the master. Works fine.

There you have it. An ode in prose to running your database on bare metal. Try it, you may sleep better at night!

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...