05 August 2017

TitanDB: why doesn't my index work ?

The setting

One of our customer's products is a Play2 application written in Scala on top of Lightbend Reactive Platform. The application takes advantage of a graph database – TitanDB 1.0.0 in embedded mode, backed by Cassandra.

TitanDB allows fast and convenient queries across entities bound by relations (e.g. social networks).

The application uses Titan's Java API to manipulate and query the graph.

The problem

At some point, we started noticing that certain queries were running really sluggishly – this was accompanied by Titan log messages such as this one:

Query requires iterating over all vertices [(name = user001 AND ~label = agent)]. For better performance, use indexes

This was strange, as our application did check that the necessary indexes existed – and did create them if they weren't – at its start-up. The code looked something like this:




It was a surprise: a query that was expected to use an index actually did not. For the purposes of testing, I have manually ran a hand-crafted query that was definitely supposed to use that index and ... got the same result. Interesting.

Scanning through index documentation did not shed more light (well, there actually were hints at my problem on that page, but I've missed them then). So what do software developers do when RTFM doesn't help? Right, Google it. I found a number of articles on StackOverflow, and I realized that TitanDB indexes had a lifecycle. I've used the 'RTFS' magical trick (read those sources :)) to get help:

    INSTALLED - The index is installed in the system but not yet registered with all instances in the cluster
    REGISTERED - The index is registered with all instances in the cluster but not (yet) enabled
    ENABLED - The index is enabled and in use
    DISABLED - The index is disabled and no longer in use

So it's possible our indexes weren't enabled. The next step was to understand the statuses of our indexes – this can be done via the Gremlin console (there is also a programmatic way to do that from Scala):

mgmt = g.getGraph().openManagement(); 
names = [ "CompositeNameSalesperson", "CompositeNameAgent" ]; // <-- all indexes are here, actually
res = names.collect { [ it , mgmt.getGraphIndex(it).getIndexStatus(mgmt.getPropertyKey('name')) ] };
mgmt.commit();
res

so we got something like this:

[
    "[CompositeNameSalesperson, ENABLED]",
    "[CompositeNameAgent, INSTALLED]"
]

Well-well, one of indexes is not enabled (and effectively is not used). And this is a new type of vertex, added few days ago.

The fix

Let's postpone finding the reason for the index being stuck in this INSTALLED lifecycle for a while. First order of business, I needed to repair the index by promoting it into the ENABLED state.

Since my index is INSTALLED, the next state should be REGISTERED. Let's go:

mgmt = g.getGraph().openManagement(); 
mgmt.updateIndex(mgmt.getGraphIndex("CompositeNameAgent"), com.thinkaurelius.titan.core.schema.SchemaAction.REGISTER_INDEX); 
mgmt.commit();


Now I checked the state of the index again:

mgmt = g.getGraph().openManagement(); 
names = [ "CompositeNameAgent" ]; 
res = names.collect { [ it , mgmt.getGraphIndex(it).getIndexStatus(mgmt.getPropertyKey('name')) ] };
mgmt.commit();
res

and got:

[
    "[CompositeNameAgent, INSTALLED]"
]


What ?! Why? Turned verbose logging on and I got:

Some key(s) on index CompositeNameAgent do not currently have status REGISTERED: name=INSTALLED

The next iteration of Seq(documentation, Google, StackOverlow).foreach(_.read) has yielded the following list of possible reasons due to which changing the state of the index might fail:
- open transactions
- stalled Titan instances

I was pretty sure nobody used Titan at that moment, so I went the second path to investigate. Hopefully, the Titan API can respond with enough information:

mgmt = g.getGraph().openManagement(); 
y = mgmt.getOpenInstances();
mgmt.commit();
y;

and the response is:

[ac1100041-hostname(current), ac1100041-hostname, ac1100021-hostname]

Hmm, so why are there 3 instances if I knew that only one application instance was running? Let's consult the documentation: "... TitanFactory can also be used to open an embedded Titan graph instance from within a JVM-based user application ..." – this would explain the existence of 2 instances (we're a using the factory twice) but what about the 3rd one?

Continuing reading the documentation about failures I got at my scenario:
However, some schema related operations - such as installing indexes - require the coordination of all Titan instances. For this reason, Titan maintains a record of all running instances. If an instance fails, i.e. is not properly shut down, Titan considers it to be active and expects its participation in cluster-wide operations which subsequently fail because this instances did not participate in or did not acknowledge the operation. 
In this case, the user must manually remove the failed instance record from the cluster and then retry the operation.

Bingo! That's explains the reason for our inability to change the status of the index. The "ac1100021-hostname" instance seems to be the black sheep here. Let's get rid of it:

toRemove = ["ac1100021-hostname"];
mgmt = g.getGraph().openManagement();  
x = mgmt.getOpenInstances();  
toRemove.collect{  mgmt.forceCloseInstance(it) };
mgmt.commit();   
y = mgmt.getOpenInstances();   
[x, y]; 


that operation returned:

[
    "[ac1100041-hostname(current), ac1100041-hostname, ac1100021-hostname]",
    "[ac1100041-hostname(current), ac1100041-hostname]"
]


Now I have repeated the operations stated at the begging of this chapter and I got:

[
    "[CompositeNameAgent, REGISTERED]"
]

I did that! Finally let's enable index...

mgmt = g.getGraph().openManagement(); 
mgmt.updateIndex(mgmt.getGraphIndex("CompositeNameAgent"), com.thinkaurelius.titan.core.schema.SchemaAction.ENABLE_INDEX); 
mgmt.commit();

...and verify the result:

mgmt = g.getGraph().openManagement(); 
names = [ "CompositeNameAgent" ]; 
res = names.collect { [ it , mgmt.getGraphIndex(it).getIndexStatus(mgmt.getPropertyKey('name')) ] };
mgmt.commit();
res


we finally get:

[CompositeNameAgent, ENABLED]

I've checked the original query (which was suppose to use the index) - and it ran as fast as other queries. Problem solved! But wait a minute! Game is not over.

Root causes

So why did the index get stuck in the 'INSTALLED' state in the first place? (also I recall seeing some of the indexes being stuck in the 'REGISTERED' state)

I end up with 2 possible reasons:

a. In case an application instance (with an embedded Titan instance) crashes, it remains intact in Titan bookkeeping as a cluster member – this dead instance prevents index state from progressing, as it can not be communicated with. The following log records can be observed in this case:

Some key(s) on index CompositeNameAgent do not currently have status ENABLED: name=REGISTERED
...
Timed out (PT1M) while waiting for index CompositeNameAgent to converge on status ENABLED

An advice: make sure there are no dangling Titan cluster members - clear them our prior to creating indexes (or doing any other cluster-wide operation).

b. An issue with the code which creates indexes:

          val index = management.buildIndex(indexName, classOf[Vertex]).addKey(key).indexOnly(label).unique().buildCompositeIndex()
          management.setConsistency(index, ConsistencyModifier.LOCK)

Reading through Titan documentation and StackOverflow articles, I have found that this call returns as soon as the index has entered the INSTALLED state. To be sure that the index has actually entered the ENABLED state (and hence is actually ready to be used by queries), it is required to wait for index creation completion outside the transaction where it was created, and make sure no other transactions bother us. So the index setup code (see the beginning of the article) was changed to:



31 July 2017

Indoor shooting range: automated billing and admissions

In this blog entry we will share a business case of Digital Magic helping a business evolve a fledgling idea into a mature product.


The setting

A while back a person has decided to establish an indoor shooting range business. His plan was to refurbish an abandoned factory building and hire an administrator who would deal with shooting range customer admissions.


The problem

Selling bullets or cartridges, it is easy to ensure that every item has been accounted for by comparing the physical state of ammo inventory to the amount of cash the company account has received since the last inventory. This is not so with "shooting time", which is not easily tangible.

Simply put, the business owner wanted to ensure that shooting time is correctly inventoried.


The concept

The business owner came up with an idea – separating the two conflicting concerns out:
  • admissions sales would be handled by the administrator
  • physical access to a shooting lane and customer billing would be handled completely by an automated system
The business owner then approached us – the Digital Magic team – with this idea and asked to help make it real.


The solution

We sat together with the business owner, analysing the idea, crystallising it into a vision of an actual product.

One of the fundamental questions was the identification of a customer account with which to associate credit. Having discarded technologies such as loyalty cards, RFID stamps, optical face recognition and mobile phone-based IDs, we have converged at customer identification based on smart cards. To be more precise, Estonian ID cards (as the business resides in Estonia, where every resident possesses such a card).

The solution was then straightforward:
  • each shooting lane would be equipped with:
    • its own lighting
    • a rail running along the ceiling allowing the target to be brought closer to or further away from the firing point
    • an electric motor controlled by a joystick that would actually move the target
    • a smart card reader
  • initially, all shooting lanes would have their lights off, and their targets brought close to the firing point, which renders shooting impossible (or at least no fun :))
  • a customer would enter the shooting range and buy a certain time allowance from the administrator
  • using a management console, the administrator would associate the credit with the customer’s account, based on his smart card ID number
  • the client would then approach a shooting lane and insert his ID smart card into that lane’s card reader
  • the system would turn the lane’s lights on, and power the electric motor joystick on so that the customer could move the target as far away from the firing point as he wishes
  • the system would track the time during which the smart card resides within the card reader
  • as soon as credit runs out (or the card gets removed), the system would turn the lane’s lights off and retract the target close to the firing range, powering the joystick down, effectively denying the customer the ability to shoot.

The design

The business owner took care of installing high-current electrics and devices: each shooting lane was now equipped with lighting, electrical motors shifting the shooting target along the rail, and motor control joysticks - all connected to a power source. Each lane could now individually be powered up or down by simply closing the corresponding pair of wires.

Actual activation and deactivation of each lane was now up to an intelligent system, which - both hardware and software - was to be designed, implemented and installed by Digital Magic.

The diagram below depicts the overall design:

Hardware

First order of business was to choose a controller that would become the bridge between the software customer account management system and the high-current hardware lane controls.

We have chosen the Laurent-112 Ethernet network relay, which has the ability to:
  • control all 7 shooting lanes (there are 12 relays on the PCB)
  • connect to Ethernet and expose a HTTP server with a simple API
The next piece of hardware were smart card readers - we just used generic USB readers. One challenge was that the firing points were some distance away from the server room, so each reader USB cable was extended with 20+ meter long extension cables each featuring a signal repeater.

The server running the actual management application is a simple office PC running Ubuntu, too boring for its specs to be listed :) One requirement for the PC, though, was having at least 7 USB ports - one for each shooting lane. We have also tried connecting all readers via an USB hub with success - the server could still discern individual card readers - but eventually settled on the simpler hub-less solution.


Software

The application server is the brain behind the operation.

It is an application based on Play and Akka Persistence - you can read more about the latter technology in this blog post, however the choice of frameworks did not matter much for this business case, as the project was small-scale.

The application server is hooked up as follows:
  • 7 smart card readers (installed at each shooting lane's firing point) are connected directly to the PC's 7 USB ports.
    • Ubuntu gives each USB port a name; server configuration binds each port name to its shooting lane.
  • the PC is connected to the Laurent-112 controller via Ethernet.
    • The controller can be sent simple commands via HTTP using a proprietary protocol to:
      • close a relay, activating a particular shooting lane
      • open a relay, turning that lane's lights off
      • check relay status
    • Server configuration binds each relay to its shooting lane.
The application server's main responsibility is listening to javax.smartcardio smart card events:
  • whenever an ID card is inserted into the card reader of shooting lane N, the customer account associated with the card is checked for credit availability
    • if there is enough credit, a "close relay N" command is sent to the Laurent-112 controller, activating the corresponding lane
    • the system starts tracking passage of time with 1-second resolution for that lane.
    • as soon as customer's credit gets depleted, the application disengages the shooting lane.
  • whenever an ID card is removed from the card reader of shooting lane N, an "open relay N" command is sent to the Laurent-112 controller, deactivating the corresponding lane.
The application server also exposes the following functionality via WEB-based management consoles:
  • a WEB UI for the shooting range administrator
    • associate credit with a customer's account
  • a WEB UI for the business owner
    • create and delete administrator accounts
    • create tariff plans
    • request billing reports:
      • hourly customer activity reports
      • summary of customer payments during a time period

The bottom line

Using billing reports generated by the system, the business owner was now able to easily inventory shooting time by comparing banking account cash flows to the amount of credit sold as registered by the system.

Problem solved.


Application server frameworks

  • Play 2
  • Akka Persistence + LevelDB backend (very light loads)
  • javax.smartcardio to interact with smart cards
  • written in Scala

Issues we've encountered

On a rare occasion, e.g. sometimes upon Ubuntu server reboot, the USB port names would change (so that e.g. the first shooting lane's card reader, which was previously named OMNIKEY CardMan 1021 01 00, would now be named e.g. OMNIKEY CardMan 1021 06 00). The occasions were so rare that the business owner had no trouble fixing them by re-mapping ports via the WEB management console.

In order to correctly handle application crashes or application server hardware failures, the Laurent-112 PCB needs to be configured to periodically ping the application to sense outages - in such cases, the PCB would open all relays, deactivating all shooting lanes, as by this moment the main application has lost the ability to track credit. At this particular installation, we have not done so, but it is perfectly possible by e.g. running the application as a Docker container with a heartbeat port exposed. Any kind of a crash would render the port inaccessible.


Ideas for the future

Additional reports, such as a heat map report, which would help answering questions such as:
  • which are peak customer activity hours?
  • which are the best business hours?
  • should additional shooting lanes be built?
Also, the list of services offered by the automated system could be extended to selling not just shooting time, but also guns, ammo and targets, each with their automatically calculated prices based on tariffs, client track record, campaigns, etc.