Graph databases are one of those things that sound neat but you'll be hard pressed to find people using them that don't regret it.
I memorably had a job interview which consisted almost entirely of their senior architect going over exactly why he regretted introducing Neo4J several years earlier and how all the work is really about getting away from it. That was just the most extreme example.
The truth that people here don't like is that the Couch/Mongo style document DB is far more compelling as a intermediate point of structured/unstructured. There was even a mongo DB compatibility layer for foundation DB, but it doesn't seem to be maintained sadly. https://github.com/FoundationDB/fdb-document-layer
>The truth that people here don't like is that the Couch/Mongo style document DB is far more compelling as a intermediate point of structured/unstructured.
in my opinion Graph DBs should only be used for highly structured data, which after all is what a graph is. Generally anything you would represent in SQL with too many joins to do queries you commonly have to do.
What’s so much more compelling about a graph db than a “with recursive” sql call? And even such recursive sql calls should be fairly rare since you would rather cache frequently accessed foreign key relationships anyway.
Among other things, some graph DBs are designed to perform the equivalent of "with recursive", filtering by attributes as you chase links and/or calculating statistics over subgraphs, more efficiently than SQL databases.
If you're just pulling up a tree of comments on an article using parent-child relations, SQL will be fine, though for query latency you might be better off with a "flat list" article-comment relation instead and recovering the tree structure after fetching the comments.
Under the hood they're basically the same (modulo graph databases are generally less mature and less featureful in their storage implementation). The main difference is impedance match with the human writing the query. Graph databases offer query language semantics that are easier to write and reason about, for graphy workloads.
Even for use cases Graph dbs knock out of the park Neo4j (historically, I haven't used it in like 10 years) didn't work very reliably compared to modern competitors.
But as always it's about picking the right tool for the job - I tried to build a "social network" in mysql and neo4j, and (reliability aside) neo4j worked way better.
Having worked with SQL for many years now, I truly disliked using graphql when I used it. Most of the time all I need is just a csv-like structure of the data I'm working with and I had a very difficult time with just getting something like that. Joining tables tended to be difficult where the pivot points between relationships was rather semantically unbounded, so it was difficult for me to create a reasonably rigid mental model of the data I was working with. That was especially noticeable in some of the API documentation I worked with -- where a simple REST-like API just gives a simple, usually flat, json... graphql api responses were deeply nested and often in implementation-dependent ways[1] of thousands of lines. Which has IME made explorability practically impossible.
GraphQL implementations have consistently felt like a hobby project where implementing it in an API becomes a thing to put on a resume rather than making a functionally useful API.
Essentially that using the graph DB prevented any imposition of order or discipline on the structure of the data, due to the constant need to import new customer data in subtly different structures to keep the business running, which led to a complete inability to deliver anything new at all since no one could make assertions about what's in there. Since they couldn't change it without risking breaking a customer they were migrating one customer at a time to a classic RDBMS. (There were like 200 customers, each of which is a company you've probably heard of).
Many will go "you need a proper ontology" at which point just use a RDBMS. Ontologies are an absolute tarpit, as the semantic web showed. The graph illusion is similar to that academic delusion "It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures." which is one of those quips that makes you appreciate just how theoretical some theoreticians really are.
It depends on the shape of your data. In my domain (cloud security), there are many many entities and it's very valuable to map out how they relate to each other.
For example, we often want to answer a question like: “Which publicly exposed EC2 instances are used by IAM roles that have administrative privileges in my AWS account?”
To answer the question, you need to:
1. Join ec2 instances to security groups to IP rules to IP ranges to find network exposure paths to the open internet.
2. Join the instances to their instance profiles, to their roles.
3. Join the IAM roles to their role policies to determine which have admin policies.
4. Chain all of those joins together, possibly with recursive queries if there are indirect relationships (e.g., role assumption chains).
That’s a lot of joins, and the SQL query would get both heavy and hard to maintain.
In graph this query looks something like
match (i:EC2Instance)--(sg:EC2SecurityGroup)--(r:IPPermissionInbound{action:"Allow"})--(rng:IPRange{id:"0.0.0.0/0"})
match (i)--(r:AWSRole)--(p:AWSPolicy)--(stmt:AWSPolicyStatement{effect:"Allow", resource:"*"})
return i.id as instance_id, r.name as role_name
To answer this question what internet open compute instances can act as admins in our environment, we needed to traverse multiple objects, but the shape of the answer is pretty simple: just a list of ids and names.
Graph databases have quirks and add complexity of their own. If your domain isn't this edge heavy, you're probably better off with Postgres, but for our use-case it's been worth the trade-off imo.
I'm building https://www.ergodic.ai - and we are using a graphs as the primary objects in which the intelligence operates.
I don't think every graph needs a graph database. For 99% of use-cases a relational database is the preferred solution to store a graph: provided that we have objects and ways to link objects, we're good to go. The advantages of graph dbs are in running more complex graph algorithms whenever that is required (transversal, etc) which is more efficient than "hacking it" with recursive queries in a relational db.
For us, I've yet to find the need for a dedicated graph db with few exceptions, and in those exceptions https://kuzudb.com/ was the perfect solution.
> which is more efficient than "hacking it" with recursive queries in a relational db
It seems to me that the way recursive CTEs were originally defined is the biggest reason that relational databases haven't been more successful with users who need to run serious graph workloads - in Frank McSherry's words:
> As it turns out, WTIH RECURSIVE has a bevy of limitations and mysterious semantics (four pages of limitations in the version of the standard I have, and I still haven't found the semantics yet). I certainly cannot enumerate, or even understand the full list [...] There are so many things I don't understand here.
"After trillions spent in GPUs and data centers, the AI gold rush was finally over when a developer in Lithuania built the pg_thinking plugin - turns out postgres was all you needed all along."
I used a graph database in https://www.exploravention.com/products/askarch/ because software architects typically need to understand the dependencies of a complex software system before they can suitably lead that technology. A dependency graph is a good data structure to use when reasoning about dependencies and a graph database is a natural choice for capturing dependency graphs. See https://www.infoq.com/articles/architecting-rag-pipeline/ for more details on the architecture of this AI product. The graph database works very well for this use case.
If you are considering the use of a graph database for AI based search and you are not already familiar with graph database technology, then you should be advised that graph databases are not relational databases. If you cognitively model nodes = tables and edges = joins, then you will be in for some nasty surprises. You should consider some learning, and some unlearning, to do before proceeding with that choice.
I don't think the cognitive models are that distinct, just a different way in which relations are stored. In any case, not distinct enough to warrant 'unlearning' relational approaches. While I find graph based approaches more natural to some problems we can stretch the relational paradigm quite a bit.
I am using Neo4j to build an equipment database, I also use MySQL in the same project to store transactional data. It took some time to figure out right syntax for Spring Boot/Neo4j Cypher query, but now it works OK. The reason I chose Neo4j? Because I wanted to play with it:). I can say it is more flexible than relational databases. I would like to continue using it, but you can't create multiple instances within a database, I guess it is possible to do so by installing separate binaries, but I have not tried it yet.
They’re nice if you let an agent decide what memories to store (so, schemaless in a way). This is how SOTA memory works today. There was a paper about it last month.
Start with Postgres and scale later once you have a better idea of your access patterns. You will likely model your graph as entities and recursively walk the graph (most likely through your application).
If the goal is to maintain views over graphs and performance/scale matters, consider Feldera. We see folks use it for its ability to incrementally maintain recursive SQL views (disclaimer: I work there).
The weakness of these systems is not the query language or the planner.
It's the lack of a fully developed storage engine that avoids vendor lock-in.
Apache GraphAr (incubating) is a step in this direction. But it's an import/export format. Not primary storage.
Unaware of this effort (roots in Chinese graph dbs), I wrote a competing proposal that's more aimed at graphdbs looking to disaggregate compute and storage.
Graph DBs model relationship between entities. Is that a useful property of your retrieval task? They won't magically make your retrieval better without other additional work.
How are you evaluating your current retrieval? Can you get to the point where you can compare your current solution with a Graph based one?
A lot of the time i've seen people reach for a Graph DB they actually wanted/needed re-ranking of results.
An aside but the Director of ML at a company I worked for kept telling us "We need a Graph! We need a Graph!" and when questioned about _why_ said because we could find fastest routes between train stations (it was for a big train ticket retailer) - no matter how many times we told him we don't set the routes and timetables, it's set by National Rail.
I (and many others) left the company shortly after his arrival.
>> An aside but the Director of ML at a company I worked for kept telling us "We need a Graph! We need a Graph!"
It depends. Maybe they knew something that the team didn't but couldn't articulate it. Maybe it would have been great. Alternately (and this seems to be a common tactic unfortunately), is they don't really know what they are doing but use the strategy of introducing a large / time consuming change and promise incredible things once the change is complete. The longer the change takes, the better in this situation as they can just chill while the change is taking place and polish resume for the next gig if it doesn't work out. If they jump to a new job before failure is obvious they can claim that they affected some large change at previous company and repeat the process. The other strategy is to performtatively claim success in the face of failure and move on to the next big thing.
I am partial to the approach in https://www.expasy.org/about-chat ;) so yes I can think it can help. Mostly though the use case becomes interesting when you deal with multiple graph databases e.g. UniProt + WikiData etc.
If it is just to query one single dataset that is already in one tool it is less compelling.
I had one client who made their whole thing about knowledge graphs, which I worked on because I needed money and it was interesting, but I am still a little suspicious that they may have had "knowledgebase" and "knowledge graphs" mixed up and did not know about vector search.
I think for the particular use case, something like filtering the vector search based on tags for each document and then (maybe) a relatively inexpensive and fast reranking LLM step could have worked as well or better. But the reranker is not necessarily important with a strong model and including enough results.
I have seen a number of knowledge graph MCPs and I am curious if anyone is using them to get decent results. On the surface it seems like it would be a good way to index internal or niche knowledge, but I have never met anyone who is actually using these tools in practice.
In my experience people think they need a graph database to express a Knowledge Graph but actually the document oriented database with appropriate levels of embedded context and links to related context seems to work just fine
I really like using jazz.tools for those usecases. It lets me iterate 10x faster and i'm super flexible with the schema. Also I can share the data with clients directly without the need of a specific api.
The shared knowledge base with my agent has some graph-like areas, some table-like areas, and some that seem to evolve back and forth.
So far, I'm using postgres and helping the agent develop MCP functions for it. As we find some optimizations (or at least, reliably performant daily routines), I might make the choice to represent some relationships in a graph database.
One thing I'm building that seems plausibly likely to eventually call for a graph DB is "The Oracle of Bluegrass Bacon" (like the Oracle of Kevin Bacon, but for string band pickers). And it's nice for the agent to have fairly optimal access to this data as we build other adjacent projects.
The Terminus DB folks have been doing projects with LLM:s for some years by now, I'd assume they've had some success.
Don't remember if their licensing is annoying, but it was rather neat as a graph storage when I tried it out last christmas or thereabouts. If you actually have a fitting need it's probably a decent option, there are some graphical interfaces and so on that people who aren't technical specialists can use.
As a human user, consider file-system navigation or code search. You navigate to a seed file, and then hop through the children and dependencies until you find what you're looking for. The value is in explicit and direct edges. Perfect search is hard. Landing in the right neighborhood is less so. (It's like golf if you think about it)
Agentic systems loop through the following steps - (Plan -> Inquire -> Retrieve -> Observe -> Act -> Repeat). The agent interacts with your search-system during the inquire and retrieve phases. In these phrases, There are 2 semantic problems that a simple embedding based search or a simple db alone can't solve: seeding and completeness. Seeding - How do you ask a good question when you don't know what you don't know ? Completeness - once you know a little bit, how do you know that you have obtained everything you need to answer a question ?
A solid embedding based search allows under-defined free-form inquiry, and puts the user near the data they're looking for. From there, an explicit graph allows the agent to navigate through the edges until it hits gold or gives the agent enough signal to retry with better informed free-form inquiry. Together, they solve the seeding problem. Now, once you have found a few seed nodes to work off of, the agent can keep exploring the neighbors, until they become sufficiently irrelevant. At that threshold, the retrieval system can return the explored nodes with a measurable metric of confidence in completeness. This makes completeness a measure that you can optimize, helping solve the 2nd problem.
You'll notice that there is no magic here. The quality of your search will depend on the quality of your edges, entities, exploration strategy and relevance detectors. This requires a ton of hand-engineering and subject specific domain knowledge, neither of which are systems bottlenecks. The data-store itself will do very little to help get you a better answer.
Which brings me to your question, the datastore. The datastore only matters at sufficient scale. You CAN implement Graph RAG in a standard database. Get a column to track your edges, a column to track entities and some way to search over embeddings and you're good. You can get it done in an afternoon (until permissions become an issue, but I digress).
We know that a spotlight style file-system search works just fine on 100k+ documents, while your mac's fan barely even turns on. If you're asking this question, then your company probably doesn't scale past that point. In fact, I'd argue that few companies will ever cross that threshold for agentic operations. At this scale, your postgres instance won't be the bottleneck.
Comparing postgres to graph-rag-startups, the real value of using a native graph-RAG solution is their defaults. The companies know that their user's need is agentic semantic search, and the products come preloaded with defaults that give you embeddings, entities and graph-edges that aren't completely useless. From a practical standpoint, those extras might push you over the edge. But be aware that your performance gains are coming from outsourcing the hand-engineering of features and not the data structure itself.
My personal opinion is to keep the data structure as simple as possible. MLEs and Data Scientists are mediocre systems engineers and it is okay to accept that. You want your ML & product team to be able to iterate on the search-logic and quality as fast as possible. That's where the real gains will come from. Speaking from experience, premature optimization in a new field will slow your team down to a crawl. IE. Go with postgres if that's what's simple for everyone to work with.
tldr: It's not about the scalability of the datastructure, it's about how you use it.
Graph databases are one of those things that sound neat but you'll be hard pressed to find people using them that don't regret it.
I memorably had a job interview which consisted almost entirely of their senior architect going over exactly why he regretted introducing Neo4J several years earlier and how all the work is really about getting away from it. That was just the most extreme example.
The truth that people here don't like is that the Couch/Mongo style document DB is far more compelling as a intermediate point of structured/unstructured. There was even a mongo DB compatibility layer for foundation DB, but it doesn't seem to be maintained sadly. https://github.com/FoundationDB/fdb-document-layer
>The truth that people here don't like is that the Couch/Mongo style document DB is far more compelling as a intermediate point of structured/unstructured.
in my opinion Graph DBs should only be used for highly structured data, which after all is what a graph is. Generally anything you would represent in SQL with too many joins to do queries you commonly have to do.
What’s so much more compelling about a graph db than a “with recursive” sql call? And even such recursive sql calls should be fairly rare since you would rather cache frequently accessed foreign key relationships anyway.
Among other things, some graph DBs are designed to perform the equivalent of "with recursive", filtering by attributes as you chase links and/or calculating statistics over subgraphs, more efficiently than SQL databases.
If you're just pulling up a tree of comments on an article using parent-child relations, SQL will be fine, though for query latency you might be better off with a "flat list" article-comment relation instead and recovering the tree structure after fetching the comments.
Under the hood they're basically the same (modulo graph databases are generally less mature and less featureful in their storage implementation). The main difference is impedance match with the human writing the query. Graph databases offer query language semantics that are easier to write and reason about, for graphy workloads.
> why he regretted introducing Neo4J
Even for use cases Graph dbs knock out of the park Neo4j (historically, I haven't used it in like 10 years) didn't work very reliably compared to modern competitors.
But as always it's about picking the right tool for the job - I tried to build a "social network" in mysql and neo4j, and (reliability aside) neo4j worked way better.
What were some of the pain points mentioned?
Having worked with SQL for many years now, I truly disliked using graphql when I used it. Most of the time all I need is just a csv-like structure of the data I'm working with and I had a very difficult time with just getting something like that. Joining tables tended to be difficult where the pivot points between relationships was rather semantically unbounded, so it was difficult for me to create a reasonably rigid mental model of the data I was working with. That was especially noticeable in some of the API documentation I worked with -- where a simple REST-like API just gives a simple, usually flat, json... graphql api responses were deeply nested and often in implementation-dependent ways[1] of thousands of lines. Which has IME made explorability practically impossible.
GraphQL implementations have consistently felt like a hobby project where implementing it in an API becomes a thing to put on a resume rather than making a functionally useful API.
[1] https://graphql-docs-v2.opencollective.com/queries/account (navigate to the variables tab for a real mindfuck)
And GraphQL is related to Graph databases how exactly? Just because they both have the word graph in them?
Heh, you're right I got it wrong but there's really no need for your weird aggression. Cheers.
It's unfortunate GrapgQL co-opted the terminology because it's quite different from the kinds of graphs these databases attempt to model.
IME the dreaded out of memory error. This is all they have to say on the matter: https://neo4j.com/developer/kb/recommendations-for-recovery-...
Essentially that using the graph DB prevented any imposition of order or discipline on the structure of the data, due to the constant need to import new customer data in subtly different structures to keep the business running, which led to a complete inability to deliver anything new at all since no one could make assertions about what's in there. Since they couldn't change it without risking breaking a customer they were migrating one customer at a time to a classic RDBMS. (There were like 200 customers, each of which is a company you've probably heard of).
Many will go "you need a proper ontology" at which point just use a RDBMS. Ontologies are an absolute tarpit, as the semantic web showed. The graph illusion is similar to that academic delusion "It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures." which is one of those quips that makes you appreciate just how theoretical some theoreticians really are.
Neo4j has a GraphRAG book that I've found very helpful: https://neo4j.com/essential-graphrag/
It depends on the shape of your data. In my domain (cloud security), there are many many entities and it's very valuable to map out how they relate to each other.
For example, we often want to answer a question like: “Which publicly exposed EC2 instances are used by IAM roles that have administrative privileges in my AWS account?”
To answer the question, you need to: 1. Join ec2 instances to security groups to IP rules to IP ranges to find network exposure paths to the open internet. 2. Join the instances to their instance profiles, to their roles. 3. Join the IAM roles to their role policies to determine which have admin policies. 4. Chain all of those joins together, possibly with recursive queries if there are indirect relationships (e.g., role assumption chains).
That’s a lot of joins, and the SQL query would get both heavy and hard to maintain.
In graph this query looks something like
match (i:EC2Instance)--(sg:EC2SecurityGroup)--(r:IPPermissionInbound{action:"Allow"})--(rng:IPRange{id:"0.0.0.0/0"}) match (i)--(r:AWSRole)--(p:AWSPolicy)--(stmt:AWSPolicyStatement{effect:"Allow", resource:"*"}) return i.id as instance_id, r.name as role_name
To answer this question what internet open compute instances can act as admins in our environment, we needed to traverse multiple objects, but the shape of the answer is pretty simple: just a list of ids and names.
Graph databases have quirks and add complexity of their own. If your domain isn't this edge heavy, you're probably better off with Postgres, but for our use-case it's been worth the trade-off imo.
I'm building https://www.ergodic.ai - and we are using a graphs as the primary objects in which the intelligence operates.
I don't think every graph needs a graph database. For 99% of use-cases a relational database is the preferred solution to store a graph: provided that we have objects and ways to link objects, we're good to go. The advantages of graph dbs are in running more complex graph algorithms whenever that is required (transversal, etc) which is more efficient than "hacking it" with recursive queries in a relational db.
For us, I've yet to find the need for a dedicated graph db with few exceptions, and in those exceptions https://kuzudb.com/ was the perfect solution.
> which is more efficient than "hacking it" with recursive queries in a relational db
It seems to me that the way recursive CTEs were originally defined is the biggest reason that relational databases haven't been more successful with users who need to run serious graph workloads - in Frank McSherry's words:
> As it turns out, WTIH RECURSIVE has a bevy of limitations and mysterious semantics (four pages of limitations in the version of the standard I have, and I still haven't found the semantics yet). I certainly cannot enumerate, or even understand the full list [...] There are so many things I don't understand here.
https://github.com/frankmcsherry/blog/blob/master/posts/2022...
would you consider needing to do community detection as a reason for using graph over relational?
> For 99% of use-cases a relational database is the preferred solution…
After enough years you realize this is the case for every single problem
"After trillions spent in GPUs and data centers, the AI gold rush was finally over when a developer in Lithuania built the pg_thinking plugin - turns out postgres was all you needed all along."
I used a graph database in https://www.exploravention.com/products/askarch/ because software architects typically need to understand the dependencies of a complex software system before they can suitably lead that technology. A dependency graph is a good data structure to use when reasoning about dependencies and a graph database is a natural choice for capturing dependency graphs. See https://www.infoq.com/articles/architecting-rag-pipeline/ for more details on the architecture of this AI product. The graph database works very well for this use case.
If you are considering the use of a graph database for AI based search and you are not already familiar with graph database technology, then you should be advised that graph databases are not relational databases. If you cognitively model nodes = tables and edges = joins, then you will be in for some nasty surprises. You should consider some learning, and some unlearning, to do before proceeding with that choice.
I don't think the cognitive models are that distinct, just a different way in which relations are stored. In any case, not distinct enough to warrant 'unlearning' relational approaches. While I find graph based approaches more natural to some problems we can stretch the relational paradigm quite a bit.
what are these "nasty surprises"? they're really not that different
i made a half hearted attempt while building my startup - http://www.socratify.com
The use case was to build a knowledge graph to drive recommendations for the next best thing the user should learn.
After a few weeks of getting frustrated I went back to good old Postgres and writing a few tools for agentic retrieval.
It seems the agents are smart enough to traverse a database in a graph like manner if you provide them with the right tooling and context
I am using Neo4j to build an equipment database, I also use MySQL in the same project to store transactional data. It took some time to figure out right syntax for Spring Boot/Neo4j Cypher query, but now it works OK. The reason I chose Neo4j? Because I wanted to play with it:). I can say it is more flexible than relational databases. I would like to continue using it, but you can't create multiple instances within a database, I guess it is possible to do so by installing separate binaries, but I have not tried it yet.
They’re nice if you let an agent decide what memories to store (so, schemaless in a way). This is how SOTA memory works today. There was a paper about it last month.
Start with Postgres and scale later once you have a better idea of your access patterns. You will likely model your graph as entities and recursively walk the graph (most likely through your application).
If the goal is to maintain views over graphs and performance/scale matters, consider Feldera. We see folks use it for its ability to incrementally maintain recursive SQL views (disclaimer: I work there).
Agreed, Postgres and recursive CTEs will let you simulate graph traversal with the benefit of still having a Postgres db for everything else.
The weakness of these systems is not the query language or the planner.
It's the lack of a fully developed storage engine that avoids vendor lock-in.
Apache GraphAr (incubating) is a step in this direction. But it's an import/export format. Not primary storage.
Unaware of this effort (roots in Chinese graph dbs), I wrote a competing proposal that's more aimed at graphdbs looking to disaggregate compute and storage.
https://adsharma.github.io/beating-the-CAP-theorem-for-graph...
Graph DBs model relationship between entities. Is that a useful property of your retrieval task? They won't magically make your retrieval better without other additional work.
How are you evaluating your current retrieval? Can you get to the point where you can compare your current solution with a Graph based one?
A lot of the time i've seen people reach for a Graph DB they actually wanted/needed re-ranking of results.
An aside but the Director of ML at a company I worked for kept telling us "We need a Graph! We need a Graph!" and when questioned about _why_ said because we could find fastest routes between train stations (it was for a big train ticket retailer) - no matter how many times we told him we don't set the routes and timetables, it's set by National Rail.
I (and many others) left the company shortly after his arrival.
>> An aside but the Director of ML at a company I worked for kept telling us "We need a Graph! We need a Graph!"
It depends. Maybe they knew something that the team didn't but couldn't articulate it. Maybe it would have been great. Alternately (and this seems to be a common tactic unfortunately), is they don't really know what they are doing but use the strategy of introducing a large / time consuming change and promise incredible things once the change is complete. The longer the change takes, the better in this situation as they can just chill while the change is taking place and polish resume for the next gig if it doesn't work out. If they jump to a new job before failure is obvious they can claim that they affected some large change at previous company and repeat the process. The other strategy is to performtatively claim success in the face of failure and move on to the next big thing.
I am partial to the approach in https://www.expasy.org/about-chat ;) so yes I can think it can help. Mostly though the use case becomes interesting when you deal with multiple graph databases e.g. UniProt + WikiData etc.
If it is just to query one single dataset that is already in one tool it is less compelling.
I had one client who made their whole thing about knowledge graphs, which I worked on because I needed money and it was interesting, but I am still a little suspicious that they may have had "knowledgebase" and "knowledge graphs" mixed up and did not know about vector search.
I think for the particular use case, something like filtering the vector search based on tags for each document and then (maybe) a relatively inexpensive and fast reranking LLM step could have worked as well or better. But the reranker is not necessarily important with a strong model and including enough results.
I have seen a number of knowledge graph MCPs and I am curious if anyone is using them to get decent results. On the surface it seems like it would be a good way to index internal or niche knowledge, but I have never met anyone who is actually using these tools in practice.
In my experience people think they need a graph database to express a Knowledge Graph but actually the document oriented database with appropriate levels of embedded context and links to related context seems to work just fine
I implemented a graph database for my agentic project in postgres. Not top-tier query performance but made an MVP work!
I really like using jazz.tools for those usecases. It lets me iterate 10x faster and i'm super flexible with the schema. Also I can share the data with clients directly without the need of a specific api.
i think it is too difficult to manage with little upside. setting it up is easy but managing the nodes etc is a huge overhead.
https://github.com/microsoft/graphrag
This is not agentic but pretty good results when I did a poc.
The shared knowledge base with my agent has some graph-like areas, some table-like areas, and some that seem to evolve back and forth.
So far, I'm using postgres and helping the agent develop MCP functions for it. As we find some optimizations (or at least, reliably performant daily routines), I might make the choice to represent some relationships in a graph database.
One thing I'm building that seems plausibly likely to eventually call for a graph DB is "The Oracle of Bluegrass Bacon" (like the Oracle of Kevin Bacon, but for string band pickers). And it's nice for the agent to have fairly optimal access to this data as we build other adjacent projects.
But yeah, so far just postgres.
The Terminus DB folks have been doing projects with LLM:s for some years by now, I'd assume they've had some success.
Don't remember if their licensing is annoying, but it was rather neat as a graph storage when I tried it out last christmas or thereabouts. If you actually have a fitting need it's probably a decent option, there are some graphical interfaces and so on that people who aren't technical specialists can use.
https://terminusdb.org/
Graph RAG is great, but misunderstood.
As a human user, consider file-system navigation or code search. You navigate to a seed file, and then hop through the children and dependencies until you find what you're looking for. The value is in explicit and direct edges. Perfect search is hard. Landing in the right neighborhood is less so. (It's like golf if you think about it)
Agentic systems loop through the following steps - (Plan -> Inquire -> Retrieve -> Observe -> Act -> Repeat). The agent interacts with your search-system during the inquire and retrieve phases. In these phrases, There are 2 semantic problems that a simple embedding based search or a simple db alone can't solve: seeding and completeness. Seeding - How do you ask a good question when you don't know what you don't know ? Completeness - once you know a little bit, how do you know that you have obtained everything you need to answer a question ?
A solid embedding based search allows under-defined free-form inquiry, and puts the user near the data they're looking for. From there, an explicit graph allows the agent to navigate through the edges until it hits gold or gives the agent enough signal to retry with better informed free-form inquiry. Together, they solve the seeding problem. Now, once you have found a few seed nodes to work off of, the agent can keep exploring the neighbors, until they become sufficiently irrelevant. At that threshold, the retrieval system can return the explored nodes with a measurable metric of confidence in completeness. This makes completeness a measure that you can optimize, helping solve the 2nd problem.
You'll notice that there is no magic here. The quality of your search will depend on the quality of your edges, entities, exploration strategy and relevance detectors. This requires a ton of hand-engineering and subject specific domain knowledge, neither of which are systems bottlenecks. The data-store itself will do very little to help get you a better answer.
Which brings me to your question, the datastore. The datastore only matters at sufficient scale. You CAN implement Graph RAG in a standard database. Get a column to track your edges, a column to track entities and some way to search over embeddings and you're good. You can get it done in an afternoon (until permissions become an issue, but I digress).
We know that a spotlight style file-system search works just fine on 100k+ documents, while your mac's fan barely even turns on. If you're asking this question, then your company probably doesn't scale past that point. In fact, I'd argue that few companies will ever cross that threshold for agentic operations. At this scale, your postgres instance won't be the bottleneck.
Comparing postgres to graph-rag-startups, the real value of using a native graph-RAG solution is their defaults. The companies know that their user's need is agentic semantic search, and the products come preloaded with defaults that give you embeddings, entities and graph-edges that aren't completely useless. From a practical standpoint, those extras might push you over the edge. But be aware that your performance gains are coming from outsourcing the hand-engineering of features and not the data structure itself.
My personal opinion is to keep the data structure as simple as possible. MLEs and Data Scientists are mediocre systems engineers and it is okay to accept that. You want your ML & product team to be able to iterate on the search-logic and quality as fast as possible. That's where the real gains will come from. Speaking from experience, premature optimization in a new field will slow your team down to a crawl. IE. Go with postgres if that's what's simple for everyone to work with.
tldr: It's not about the scalability of the datastructure, it's about how you use it.
i've been asking everyone. where can i find examples of 'agentic use-cases' and how does one define agentic workflows.
"agentic" just means "we hooked up the LLM's responses to some code that has some side-effect in the world". Like "search the web" or "create a file".