Showing posts with label big data. Show all posts
Showing posts with label big data. Show all posts

Tuesday, November 24, 2015

(Really) Big Data: from the trenches

Okay, people throw around 'big data' experience, but what is it really like? What does it feel like to manage a Petabyte of data. How do you get your hands around it? What is the magic formula that makes it all work seamlessly without Bridge Lines opened on Easter Sunday with three Vice Presidents on the line asking for status updates by the minute during an application outage?

Are you getting the feel for big data yet?

Nope.

Big data is not terabytes, 'normal'/SQL databases like Oracle or DB2 or GreenPlum or whatever can manage those, and big data vendors don't have a qualm about handling your 'big data' of two terabytes even though they are scoffing into your purchase order.

"I've got a huge data problem of x terabytes."

No, you don't. You think you do, but you can manage your data just fine and not even make Hadoop hiccough.

Now let's talk about big data.

1.7 petabytes
2.5 billion transactions per day.
Oh, and growing to SIX BILLION transactions per day.

This is my experience. When the vendor has to write a new version of HBase because their version that could handle 'any size of data, no matter how big' crashed when we hit 600 TB?

Yeah. Big data.

So, what's it like?

Storage Requirements/Cluster Sizing


1. Your data is bigger than you think it is/bigger than the server farm you planned for it.

Oh, and 0. first.

0. You have a USD million budget ... per month.

Are you still here? Because that's the kind of money you have to lay out for the transactional requirements and storage requirements you're going to need.

Get that lettuce out.

So, back to 1.

You have this formula, right? from the vendor that says: elastic replication is at 2.4 so for 600 TB you need 1.2 Petabytes of space.

Wrong.

Wrong. Wrong. WRONG.

First: throw out the vendors' formulae. They work GREAT for small data in the lab. They suck for big data IRL.

Here's what happens in industry.

You need a backup. You make a backup. A backup is the exact same size as your active HTables, because the HTables are in bz2-format already compressed.

Double the size of your cluster for that backup-operation.

Not a problem. You shunt that TWO PETABYTE BACKUP TO AWS S3?!?!?

Do you know how long that takes?

26 hours.

Do you know how long it takes to do a restore from backup?

Well, boss, we have to load the backup from S3. That will take 26 hours, then we ...

Boss: No.

me: What?

Boss: No. DR ('disaster recovery') requires an immediate switch-over.

Me: well, the only way to do that is to keep the backup local.

Boss: Okay.

Double the size of your cluster, right?

Nope.

What happens if the most recent backup is corrupted, that is, today's backup, because you're backing up every day just before the ETL-run, then right after the ETL-run, because you CANNOT have data corruption here, people, you just can't.

You have to go to the previous backup.

So now you have two FULL HTable backups locally on your 60-node cluster!

And all the other backups are shunted, month-by-month, to AWS S3.

Do you know how much 2 petabytes, then 4 petabytes, then 6 petabytes in AWS S3 costs ... per month?

So, what to do then?

You shunt the 'old' backups, older than x years old, every month, to Glacier.

Yeah, baby.

That's the first thing: your cluster is 3 times the size of what it needs to be, or else you're dead, in one month. Personal experience bears this out: first you need the wiggle room or else you stress out your poor nodes of your poor cluster, and you start getting HBase warnings and then critical error messages about space utilization, second, you need that extra space when the ETL job loads in a billion row transaction of the 2.5 billion transactions you're loading in that day.

Been there. Done that.

Disaster Recovery


Okay, what about that DR, that Disaster Recovery?

Your 60 node cluster goes down, because, first, you're not an idiot and didn't build a data center and put all those computers in there yourself, but shunted all that to Amazon and let them handle that maintenance nightmare.

Then the VP of AWS Oregon region contacts you and tells you everything's going down in that region: security patch. No exceptions.

You had a 24/7 contract with 99.999% availability with them.

Sorry, Charlie: you're going down. A hard shutdown. On Thursday.

What are you going to do?

First, you're lucky if Amazon tells you: they usually just do it and let you figure that out on your own. So that means you have to be ready at any time for the cluster to go down with no reason.

We had two separate teams monitoring our cluster: 24/7. And they opened that Bridge Line the second a critical warning fired.

And if a user called in and said the application was non-responsive?

Ooh, ouch. God help you. You have not seen panic in ops until you see it when one user calls and come to find it's because the cluster is down with no warning catching that.

Set up monitoring systems on your cluster. No joke.

With big data, your life? Over.

Throughput


Not an issue. Or, it becomes an issue when you're shunting your backup to S3 and the cluster gets really slow. We had 1600 users that we rolled out to, we stress-tested it, you know. Nobody had problems during normal operations, it's just that when you ask the cluster to do something, like ETL or backup-transfer, that engages all disks of all nodes in reads and writes.

A user request hits all your region servers, too.

Do your backups at 2 am or on the weekends. Do your ETL after 10 pm. We learned to do that.

Maintenance


Amazon is perfect; Amazon is wonderful; you'll never have to maintain nor monitor your cluster again! It's all push-of-the-button.

I will give Amazon this: we had in-house clusters with in-house teams monitoring our clusters, 'round the clock. Amazon made maintenance this: "Please replace this node."

Amazon: "Done."

But you can't ask anything other than that. Your data on that node? Gone. That's it, no negotiations. But Hadoop/HBase takes care of that for you, right? So you're good, right?

Just make sure you have your backup/backout/DR plans in place and tested with real, honest-to-God we're-restarting-the-cluster-from-this-backup data or else you'll never know until you're in hot water.

Vendors


Every vendor will promise you the Moon ... and 'we can do that.' Every vendor believes it.

Then you find out what's what. We did. Multiple times, multiple vendors. Most can't handle our big data when push came to shove, even though they promised they can handle data of any size. They couldn't. Or they couldn't handle it in a manageable way: if the ETL process takes 26 hours and it's daily, you're screwed. Our ETL process got down to 1.5 hours, but that was after some tuning our their part and on ours: we had four consultants from the vendor in-house every day for a year running. Part of our contract-agreement. If you are blazing the big data trail, your vendor is, too: we were inventing stuff on the fly just to manage the data coming in, and to ensure the data came out in quick, responsive ways.

You're going to have to do that, too, with real big data, and that costs money. Lots.

And, ... but it also costs cutting through what vendors are saying to you, and what their product can actually handle. Their sales people have their sales-pitch, but what really happened is we had to go through three revisions of their product just so it could be an Hadoop HBase-compilant database that could handle 1.7 petabytes of data.

That's all.

Oh, and grow by 2.5 billion rows per day.

Which leads to ...

Backout/Aging Data


Look, you have big data. Some of it's relevant today, some of it isn't. You have to separate the two, clearly and daily, if you're not, then a month, two months, two years down the road you're screwed, because you're now dealing with a full-to-the-gills cluster AND having to disambiguate data you've entangled, haven't you? with the promise of looking at aging data gracefully ... 'later.'

Well, later is right now, and your cluster is full and in one month it's going critical.

What are you going to do?

Have a plan to age data. Have a plan to version data. Have a data-correction plan.

These things can't keep being pushed off to be considered 'later' because 'later' will be far too late, and you'll end up crashing your cluster (bad) or corrupting your data when you slice and dice it the wrong way, come to find (much, much worse). Oh, and version your backups, tying them to the application version, because when you upgrade your application, your data gets all screwy, being old, or your new data format on your old application when somebody pulls up a special request to view three-year-old data is all screwy.

Have a very clear picture of what your users need, the vast majority of the time, and deliver that and no more.

We turned a 4+hour query that terminated when it couldn't deliver a 200k+ row query on GreenPlum...

Get that? 4+hours to learn your query failed.

No soup for you.

To a 10 second query against Hadoop HBase that returns 1M+ rows.

Got that?

We changed peoples' lives. What was impossible before for our 1600 users was now in hand in 10 seconds.

But why?

Because we studied all their queries.

One particular query was issued 85% of the time.

We built our Hadoop/HBase application around that, and shunted the other 15% of the queries other tools that could manage that load.

Also, we studied our users: all their queries were in transactions of within the last month.

We kept two years of data on-hand.

Stupid.

And that two years grew to more, month by month.

Stupider.

We had no graceful data aging/versioning/correcting plans, so, 18 months into production we were faced with a growing problem.

Growing daily.

The users do queries up to a month? No problem: here's your data in less than 10 seconds, guaranteed. You want to do research, you put in a request.

Your management has to put their foot down. They have to be very clear what this new-fangled application is delivering and the boundaries on what data they get.

Our management did, for the queries, and our users loved us. You put in a query and it takes four hours, and only 16 queries are allowed against the system to run at any one time to: anyone, anywhere can submit a query and it returns right away?

Life-changing, and we did psychological studies as well as user-experience studies, too, so I'm not exaggerating.

What our management did not do is put bounds on how far back you could go into the data set. The old application had a 5 year history, so we thought two years was good. It wasn't. Everybody only queried on today, or yesterday, or, rarely: last week or two weeks ago. We should have said: one month of data. You want more, submit a request to defrost that old stuff. We didn't and we paid for it in long, long meetings around the problem of how to separate old data from new and what to do to restore old data, if, ever (never?) a request for old data came. If we had a monthly shunt to S3 then to Glacier, that would have been a well-understood and automatic right-sizing from the get-go.

You do that for your big data set.

Last Words


Look. There's no cookbook or "Big Data for Dummies" that is going to give you all the right answers. We had to crawl through three vendors to get to one who didn't work out of the box but who could at least work with us, night and day, to get to a solution that could eventually work with our data set. So you don't have to do that. We did that for you.

You're welcome.

But you may have to do that because you're using Brand Y not our Brand X or you're using Graph databases, not Hadoop, or you're using HIVE or you're using ... whatever. Vendors think they've seen it all, and then they encounter your data-set with its own particular quirks.

Maybe, or maybe it all will magically just work for you.

And let's say it does all magically work, and let's say you've got your ETL tuned, and your HTables properly structured for fast in-and-out operations.

Then there's the day-to-day daily grind of keeping a cluster up and running. If your cluster is in-house ... good luck with that. Have your will made out and ready for when you die from stress and lack of sleep. If your cluster is from an external vendor, just be ready for the ... eh ... quarterly, at least, ... times they pull the rug out from under you, sometimes without telling you and sometimes without reasonably fair warning time, so it's nights and weekends for you to prep with all hands on deck and everybody looking at you for answers.

Then, ... what next?

Well: you have big data? It's because you have Big Bureaucracy. The two go together, invariably. That means your Big Data team is telling you they're upgrading from HBase 0.94 to HBase whatever, and that means all your data can go bye-bye. What's your transition plan? We're phasing in that change next month.

And then somebody inserts a row in the transaction, and it's ... wrong.

How do you tease a transaction out of an HTable and correct it?

An UPDATE SQL statement?

Hahaha! Good joke! You so funny!

Tweep: "I wish twitter had an edit function."

Me: Hahaha! You so funny!

And, ooh! Parallelism! We had, count'm, three thousand region servers for our MapReduce jobs. You got your hands around parallelism? Optimizing MapReduce? Monitoring the cluster as the next 2.5 billion rows are processed by your ETL-job?

And then a disk goes bad, at least once a week? Stop the job? Of course not. Replace the disk (which means replacing the entire node because it's AWS) during the op? What are the impacts of that? Do you know? What if two disks go down during an op?

Do you know what that means?

At replication of 2.4, two bad disks means one more disk going bad will get you a real possibility of data corruption.

How's your backups doing? Are they doing okay? Because if they're on the cluster now your backups are corrupted. Have you thought of that?

Think about that.

And, I think I've given enough experience-from-the-trenches for you to think on when spec'ing out your own big data cluster. Go do that and (re)discover these problems and come up with a whole host of fires you have to put out on your own, too.

Hope this helped. Share and enjoy.

cheers, geophf

Thursday, August 13, 2015

Uploading Data to GrapheneDB

Today.

We look at how to upload a set of data (potentially massive) to GrapheneDB. I say '(potentially massive)' because an option, of course, is to enter Cypher, line-by-line in the Neo4J web admin interface, but this becomes onerous when there are larger data sets with complex (or, strike that, even 'simple') relations.

An ETL-as-copy-paste is not a solution for the long term, no matter how you slice it (trans: no matter for how long you have that intern).

So, let's look at a viable long-term solution using a specific example.

Let's do this.

The Data

The data is actually a problem in and of itself, as it is the set of Top 5 securities by group, and it is reported by various outlets, but the reports are (deeply) embedded into ((very) messy) HTML, in the main, or have a nice, little fee attached to them if you want to tap into a direct feed.

As I'm a start-up, I have more time than money (not always true for all start-ups, but that's a good rule of thumb for this one), so, instead of buying a subscription to the top 5s-feed, I built an HTML-scraper in Haskell to download the sets of Top 5s. Scraping HTML is not in the scope of this article, so if you wish to delve into that topic, please review Tagsoup in all its glory.

Okay, prerequisite,

Step 0. scraped data from HTML: done (without explanation. Deal.)

Next, I save the data locally. I suppose I could go into a database instance, such as MySQL, but for now, I have only 50 or so days worth of data, which I'm happily managing in a file and a little map in memory.

Step 1. store data we care about locally: done

Okay, so we have scraped data, automatically stored away for us. What does it all mean? That's when I got the idea of having a way of visualizing and querying these data. Neo4J was a fit, and GrapheneDB, being DaaS (you just have to need to know that 'DaaS' means 'Data as a Service'), makes sense for collaborating as a geographically-dispersed team.

Two Options

So, how do I get the data there? Two options that we explored. One was to load the data into a local neo4j-instance and then snap-restore in the Cloud with that image. That works, the first time, but it seems to me to be rather ponderous, as this is a daily process, and I do not wish to replicate my database locally daily and snap restore to the Cloud. So, I chose the latter option, which was to build a system that takes the local map, translate that into Cypher queries (to save as graph nodes and edges), then translate those Cypher queries into JSON, then create a web client that ships that JSON-of-Cypher-queries over the wire to the targeted web service.

... Neo4J and GrapheneDB are web services that allow REST data queries... (very) helpful, that.

Step 2. Translate the local data to Cypher queries

Okay, this is not going to be a Cypher tutorial. And this is not going to be the Cypher you like. I have my Cypher, you have yours, critique away. Whatevs. Let's get on with it.

The data is of the following structure, as you saw above:

Date -> { ("Mkt_Cap", ([highs], [lows])), ("Price", ([highs], [lows])), ("Volume", [leaders]) }

And we wish to follow this structure in our graph-relational data. Let's look at the top-tier of the data-structure:



You see I've also added a 'Year'-node. I do this so I can anchor around a locus of days if I wish to explore data across a set of days.

So, okay, from there, do I then create grouping nodes of 'leaders' and 'losers' for the categorization of stocks? This gets into the data-modelling question. I chose to label the relations to the stocks as such instead of creating grouping nodes. There're tradeoffs in these modeling decisions, but I'm happy with the result:



The module that converts the internal data structures is named Analytics.Trading.Web.Upload.Cypher. Looking at that module you see it's very MERGE-centric. Why? Here's why:


What you see here is that symbols, such as, well, primarily $AAPL, and others like $T and $INTC find themselves on the Top 5s lists, over and over again. By using MERGE we make sure the symbol is created if this is its first reference, or linked-to if we've seen it before in this data set.

In this domain, MERGE and I are like this: very close friends.

Okay, Map-to-Cypher, it's a very close ... well, mapping, because the relational calculus and domain-to-co-domain-mappings have a high correspondence.

I'm at a 'disadvantage' here: I come to functional programming from a Prolog-background: I think functional data structures relationally, so, usually, mappings of my functional data structures fall very nicely into graphs.

I don't know how to help you with your data structures, especially if you've been doing the Java/SQL object/relation-mapping stuff using JPA ... I mean, other than saying: 'Switch to ... Haskell?' Sorry.

Okay, so we have a set of Cypher queries, grouped in the following structures:

Date -> [groups] where groups are Mkt_Cap, Volume, and Price

Then, for each group for that date

group -> Leader [symbols] -> Losers [symbol]

So we have (with the three groups), four sets of Cypher queries, each of the grouped Cypher query weighing in with thirty MERGE statements each (three MERGE statements for each stock symbol node). Not bad.

How do we convert this list of grouped Cypher queries into JSON that Neo4J understands?

Two things make this very easy.

  1. The JSON-structure that Neo4J accepts is very simple, it is simply a group of "statements" which are individuated into single Cypher-"statement" elements. Very simple JSON! (thank you, Neo4J!)
  2. There is a module in Haskell, Data.Aeson, that facilitates converting from data structures in Haskell into JSON-structure, so the actual code to convert the Cypher queries reduces to three definitions:

With that, we have Cypher queries packaged up in JSON.

Step 3: SHIP IT!
So now that we have our data set, converted to Cypher, and packaged as JSON, we want to send it to GrapheneDB. Before I went right to that database (I could have, but I didn't), I tested my results on a Neo4J instance running on my laptop, ran the rest call and verified the results. BAM! It worked for the one day I uploaded.


After I got that feel-good confirmation in the small, I simply switched the URL from localhost:7474 to the URL GrapheneDB provides in the "Connection" tab, and voilà: we have data here!


(lots of it!)

Step n: Every day
So now that I have the back-data uploaded, I just simply run my scraper->ETL-over-REST->GrapheneDB little system and I have up-to-the-day Top 5s stock securities for my analysis, on the Cloud.

LOLSweet!

Monday, June 29, 2015

Tabular and Visual Representations of Data using Neo4J

Corporate and Employee Relationships
Both Graphical and Tabular Results

So, there are many ways to view data, and people may have different needs for representing that data, either for visualization (in a graph:node-edges-view) or for tabulation/sorting (in your standard spreadsheet view).

So, can Neo4J cater to both these needs?

Yes, it can.

Scenario 1: Relationships of owners of multiple companies

Let's say I'm doing some data exploration, and I wish to know who has interest/ownership in multiple companies? Why? Well, let's say I'm interested in the Peter-Paul problem: I want to know if Joe, who owns company X is paying company Y for whatever artificial scheme to inflate or to deflate the numbers of either business and therefore profit illegally thereby.

Piece of cake. Neo4J, please show me the owners, sorted by the number of companies owned:

MATCH (o:OWNER)--(p:PERSON)-[r:OWNS]->(c:CORP)
RETURN p.ssn AS Owner, collect(c.name) as Companies, count(r) as Count 
ORDER BY Count DESC


Diagram 1: Owners by Company Ownership

Boom! There you go. Granted, this isn't a very exciting data set, as I did not have many owners owning multiple companies, but there you go.

What does it look like as a graph, however?

MATCH (o:OWNER)--(p:PERSON)-[r:OWNS]->(c:CORP)-[:EMPLOYS]->(p1) 
WHERE p.ssn in [2879,815,239,5879] 
RETURN o,p,c,p1


Diagram 2: Some companies with multiple owners

To me, this is a richer result, because it now shows that owners of more than one company sometimes own shares in companies that have multiple owners. This may yield interesting results when investigating associates who own companies related to you. This was something I didn't see in the tabular result.

Not a weakness of Neo4J: it was a weakness on my part doing the tabular query. I wasn't looking for this result in my query, so the table doesn't show it.

Tellingly, the graph does.

Scenario 2: Contract-relationships of companies 

Let's explore a different path. I wish to know, by company, the contractual-relationships between companies, sorted by companies with the most contractual-relationships on down. How do I do that in Neo4J?

MATCH (c:CORP)-[cc:CONTRACTS]->(c1:CORP) 
RETURN c.name as Contractor, collect(c1.name) as Contractees, count(cc) as Count 
ORDER BY Count DESC


Diagram 3: Contractual-Relationships between companies

This is somewhat more fruitful, it seems. Let's, then, put this up into the graph-view, looking at the top contractor:

MATCH (p:PERSON)--(c:CORP)-[:CONTRACTS*1..2]->(c1:CORP)--(p1:PERSON) 
WHERE c.name in ['YFB'] 
RETURN p,c,c1,p1


Diagram 4: Contractual-Relationships of YFB

Looking at YFB, we can see contractual-relationships 'blossom-out' from it, as it were, and this is just immediate, then distance 1 from that out! If we go out even just distance 1 more in the contracts, the screen fills with employees, so then, again, you have the forest-trees problem where too much data is hiding useful results with data.

Let's prune these trees, then. Do circular relations appear?

MATCH (c:CORP)-[:CONTRACTS*1..5]->(c1:CORP) WHERE c.name in ['YFB'] RETURN c,c1


Diagram 5: Circular Relationship found, but not in YFB! Huh!

Well, would you look at that. This shows the power of the visualization aspect of graph databases. I was examining a hot-spot in corporate trades, YFB, looking for irregularities there. I didn't find any, but as I probed there, a circularity did surface in downstream, unrelated companies: the obvious one being between AZB and MZB, but there's also a circular-relationship that becomes apparent starting with 4ZB, as well. Yes, this particular graph is noisy, but it did materialize an interesting area to explore that may very well have been overlooked with legacy methods of investigation.

Graph Databases.


BAM.

Saturday, June 20, 2015

Business Interrelations as a Graph

We look at a practical application of Graph Theory to take a complex representation of information and distill it to something 'interesting.' And by 'interesting,' I mean: finding a pattern in the sea of information otherwise obscured (sometimes intentionally) by all the overwhelming availability of data.

We'll be working with a graph database created from biz.csv, so to get this started, load that table into neo4j:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///[...]/biz.csv" AS csvLine
MERGE (b1:CORP { name: csvLine.contractor })
MERGE (b2:CORP { name: csvLine.contractee })

MERGE (b1)-[:CONTRACTS]->(b2)

We got to a graph of business interrelations this past week (see, e.g.: http://lpaste.net/5444203916235374592) showing who contracts with whom, and came to a pretty picture like this from the Cypher query:

MATCH (o:OWNER)--(p:PERSON)-[]->(c:CORP) RETURN o,p,c

diagram 1: TMI

That is to say: the sea of information. Yes, we can tell that businesses are conducting commerce, but ... so what? That's all well and good, but let's say that some businesses want to mask how much they are actually making by selling their money to other companies, and then getting it paid back to them. This is not an impossibility, and perhaps it's not all that common, but companies are savvy to watch dogs, too, so it's not (usually) going to be an obvious A -[contracts]-> B -[contracts] -> A relationship.

Not usually, sometimes you have to drill deeper, but if you drill too deeply, you get the above diagram which tells you nothing, because it shares too much information.

(Which you never want to do ... not on the first date, anyway, right?)

But the above, even though noisy, does show some companies contracting with other companies, and then, in turn being contracted by some companies.

So, let's pick one of them. How about company 'YPB,' for example? (Company names are changed to protect the modeled shenanigans)

MATCH (c:CORP)-[:CONTRACTS]->(c1:CORP) WHERE c.name='YPB' RETURN c, c1

diagram 2: tier 1 of YPB

So, in this first tier we see YPB contracting with four other companies. Very nice. Very ... innocuous. Let's push this inquiry to the next tier, is there something interesting happening here?

MATCH (c:CORP)-[:CONTRACTS*1..2]->(c1:CORP) WHERE c.name='YPB' RETURN c, c1

diagram 3: tier 2 of YPB

Nope. Or what is interesting is to see the network of businesses and their relationships (at this point, not interrelationships) begin to extend the reach. You tell your friends, they tell their friends, and soon you have the MCI business-model.

But we're not looking at MCI. We're looking at YPB, which is NOT MCI, I'd like to say for the record.

Okay. Next tier:

MATCH (c:CORP)-[:CONTRACTS*1..3]->(c1:CORP) WHERE c.name='YPB' RETURN c, c1

diagram 4: tier 3 of YPB

Okay, a little more outward growth. Okay. (trans: 'meh') How about the next tier, that is to say: tier 4?

MATCH (c:CORP)-[:CONTRACTS*1..4]->(c1:CORP) WHERE c.name='YPB' RETURN c, c1

diagram 5: tier 4 of YPB

So, we've gone beyond our observation cell, but still we have no loop-back to YPB. Is there none (that is to say: no circular return to YPB)? Let's push it one more time to tier 5 and see if we have a connection.

MATCH (c:CORP)-[:CONTRACTS*1..5]->(c1:CORP) WHERE c.name='YPB' RETURN c, c1


diagram 6: tier 5 with a (nonobvious) cycle of contracts

Bingo. At tier 5, we have a call-back.

But from whom?

Again, we've run into the forest-trees problem in that we see we have a source of YPB, and YPB is the destination, as well, but what is the chain of companies that close this loop. We can't see this well in this diagram, as we have so many companies. So let's zoom into the company that feeds money back to YPB and see if that answers our question.

MATCH (c:CORP)-[:CONTRACTS]->(c1:CORP)-[:CONTRACTS]->(c2:CORP)-[:CONTRACTS]->(c3:CORP)-[:CONTRACTS]->(c4:CORP)-[:CONTRACTS]->(c5:CORP) WHERE c.name='YPB' AND c5.name='GUB' RETURN c, c1, c2, c3, c4, c5

diagram 7: cycle of contracts from YPB

Aha! There we go. By focusing our query the information leaps right out at us. Behold, we're paying Peter, who pays Paul to pay us back, and it's there, plain as day.

Now, lock them up and throw away the key? No. We've just noted a cyclical flow of contracts, but as to the legality of it, that is: whether it is allowed or whether this is fraudulent activity, there are teams of analysts and lawyers who can sink their mandibles into this juicy case.

No, we haven't determined innocence or guilt, but we have done something, and that is: we've revealed an interesting pattern, a cycle of contracts, and we've also identified the parties to these contracts. Bingo. Win.

The problem analysts face today is diagram 1: they have just too much information, and they spend the vast majority of their time weeding out the irrelevant information to winnow down to something that may be interesting. We were presented with the same problem: TMI. But, by using graphs, we were able to see, firstly, that there are some vertices (or 'companies') that have contracts in and contracts out, and, by investigating further, we were able to see a pattern develop that eventually cycled. My entire inquiry lasted minutes of queries and response. Let's be generous and say it took me a half-hour to expose this cycle.

Data analysts working on these kinds of problems are not so fortunate. Working with analysts, I've observed that:

  1. First of all, they never see the graph: all they see are spreadsheets,
  2. Secondly, it takes days to get to even just the spreadsheets of information
  3. Third, where do they go from there? How do they see these patterns? The learning curve for them is prohibitive, making training a bear, and niching their work to just a few brilliant experts and shunting out able-bodied analysts who are more than willing to help, but just don't see the patterns in grids of numbers
With the graph-representation, you can run into the same problems, but 
  1. Training is much easier for those who can work with these visual patterns,
  2. Information can be overloaded, leaving one overwhelmed, but then one can very easily reset to just one vertex and expand from there (simplifying the problem-space). And then, the problem grows in scope when you decide to expand your view, and if you don't like that expanse, it's very easy either to reset or to contract that view.
  3. An analyst can focus on a particular thread or materialize that thread on which to focus, or the analyst can branch from a point or branch to associated sets. If a thread is not yielding interesting results, then they can pursue other, more interesting, areas of the graph, all without losing one's place.
The visual impact of the underlying graph (theory) cannot be over-emphasized: "Boss, we have a cycle of contracts!" an analyst says and presents a spreadsheet requires explanation, checking and verification. That same analysis comes into the boss' office with diagram 7, and the cycle practically leaps off the page and explains itself, that, coupled with the ease and speed of which these cycles are explored and materialized visually makes a compelling case of modeling related data as graphs.

We present, for your consideration: graphs.



Models presented above are derived from various Governmental sources include Census Bureau, Department of Labor, Department of Commerce, and the Small Business Administration.

Graphs calculated in Haskell and stored and visualized in Neo4J