We live in an extremely networked and digitised world. There is virtually no area of life left that has not docked to the network. Be it trade and industry, the world of work or private life, we have all gone online. Every action that we perform leaves data behind. And this happens in bulk – connection data, device data, personal data, text files, image files, audio files – a huge volume of data. What’s more, in addition to the sheer volume, the number of links between the data itself and the number of different data types is also increasing
Relational databases (RDBMS) are increasingly reaching their limits, since they store data separately according to data type – in tables. Thus, they are not statically or conceptually designed for linked data. Consequently, as the number of connections increases, so does the number of tables that needs to be considered. This makes it unclear, difficult to maintain and slow in light of increasingly dynamic data.
Another problem is the limited data model – a derivation of data types is not provided in most RDBMS. If new requirements are added, new tables often have to be added, which then need to be linked to existing tables. This can be solved by either creating a new table for each data type (normalisation) or by cramming the existing data table with fields from the derived types until it reaches its limit. Both provide relief to the database system, but it is disastrous for data analysts as they can no longer see which fields belong to which derivation or how they can aggregate data across tables or why tables in one specific constellation are connected – i.e. the background for connecting one table to another.
Not only SQL
For this reason, the NoSQL movement (“Not only SQL”) was a necessary step towards breaking out of the table schema of classical databases. Various approaches have been, and are being, tried out for this – from the Key Value Store to the improvement of RDBMS systems* to graph databases, which are the subject of this text.
*using the parameters consistency (C), availability/load distribution (A) and performance (P)
*through atomisation, outsourcing and the like
For 15 years now, they have been responsible for the digital success of many companies – not only because they make data relationships visible, but also because they enable fast action. In contrast to relational DB, they can persist data as graphs and can thus map and evaluate them as a network of relationships. This not only means that their complexity is preserved, but also that these DB database queries can run much faster – without any Join Operation, without any Map/Reduce algorithm. The most obvious example of the benefits of such functionality is Google, whose page ranking is based on graph technology.
Areas of application
Identity & Access Management (IAM) – the complexity of IT landscapes increases significantly with size, branching and diversification of companies. The volatile nature of the markets and digitisation ensure that the dynamics also grow within companies – pressure to innovate, new business areas, restructuring, flexibility and mobile work. Systems need to be turned on and off more frequently, roles and rights need to be adapted, and new stakeholders or facilities, as well as external locations such as home offices, etc., need to be permitted. A scenario in which traditional directory services are often not able to scale with the required efficiency.
Other graph databases. Since they help to keep complex and closely interwoven structures with all individuals, roles and resources as a clear network, make changes in the organisation centrally in one place and then automatically roll out over the affected networks, as well as unblock as many new directories without incurring losses in performance. This makes them excellent partners in the face of increasingly agile requirements.
- Metadata model – bird’s eye view of all connections and dependencies
- Roles and rights – faster adaptation to volatile work structures
- One for all – employees, suppliers, partners, customers and service providers – all under one roof
Fraud detection – to reduce the extent of fraud and cybercrime as a whole, it is crucial to detect compromised data as quickly as possible. Previously, the search for manipulation attempts focused on individual data sets such as users, accounts, devices and IP addresses. As the attacks are now becoming smarter, detective work has to do so likewise – in finance or e-commerce, for example, we are increasingly dealing with third-party fraud; that is, with an attacker or whole attacker group, which operates stolen, artificial or synthetic identities or a remotely controlled computer. With graph databases, such machinations are much easier to determine – with almost no temporal offset to the actual event. This is extremely helpful in the fight against money laundering, tax evasion and credit card fraud. The world’s best-known example of such use is the “Panama Papers” fraud case, which was uncovered in 2016 using graph technology.
In addition, graph databases help IT security in terms of prevention and defence. Which attack scenarios are there? How are you developing further? Which security vulnerabilities are you exploiting? And what countermeasures should be taken?
Route planning – the construction of graph databases makes them the ideal tool for any kind of network planning. In logistics, this is particularly evident in the requirements of intelligent network planning. To find the best combination, many different parameters need to be considered in transport logistics – freight, distance, means of transport, traffic routes, traffic times, loading times, loading and unloading points, nodes, fleet availability and similar. For graph databases, it is easy to optimise and simulate all variables and constants of route planning, including causal relationships and rapidly changing dependencies. Of course, this data can also be used later for dispatching.
Networks & IT operations – IT landscapes are very heterogeneous – hardware, software, servers, interfaces, routers, browsers, virtual and analog components – a multi-layered IT site. So it’s easy to imagine what work system administrators use to maintain and keep these landscapes updated. It is also easy to imagine how useful graphs can be for them since they can map these landscapes as a complex data model and thus make it easier for the administration to keep track of all dependencies, such as in cause and error analysis. This is also, and especially, against the background of these landscapes being subject to permanent change.
Sales & marketing – graph databases are essential today, especially in online trading. They are used here for target-oriented and need-based recommendation management. An example in the form of a recommendation engine – “Customers who bought this product are also interested in …”. In the past, buying tips had to be prepared and initiated overnight via batch jobs. Today, in addition to the individualisation of offers, real-time is a decisive criterion for market success.
Three graph databases in comparison
- Neo4j – Neo4j – this is the best-known graph database. Its main benefit is its powerful Cypher Query Language for querying even the most complex data relationships with good performance.
- OrientDB – this is not a pure graph database, but rather a multi-model database with graph functions, which is particularly suitable for archiving and retrieving documents, since it can also be used as a pure document or key-value store. Both Neo4j and OrientDB can be embedded, stand-alone or distributed.
- JanusGraph – the design of this has a strong focus on distribution and large amounts of data, and has various storage back-ends. It is thus a type of graph attached to existing databases and technology. Even a small example setup comes with the Cassandra database as storage and Elasticsearch for indexing, making JanusGraph definitely too extensive for stand-alone or client-side use.