In something like prolog, terms can be nested. So I can express the idea "Socrates believes that he is mortal" with bel ieves(socrates, mortal(socrates)).
Are there any popular database types that make it easy to have relations of arbitrary arity as parameters of other relations, without unduly adding to the complexity of the schema? Preferably with ungrounded statements and open-world negation? Is there some obvious reason why not? Is there a computational complexity problem that arises in the real world? RDF allows triples to be referenced, I believe, but you are limited to arity 2, which seems needlessly limiting. Labeled graphs have arbitrary arity for non-entities, but entities are limited to two, and you usually can't refer to an edge. It seems... weird to me. Is it just that we don't really have the efficient reasoners over those kinds of expressions, so it hasn't been useful?
Sounds like a "nested hypergraph" (which I have never used before): arxiv.org/abs/2405.12235
đź“ť Hypergraph: A Unified and Uniform Definition with Application to Chemical Hypergraph
The conventional definition of hypergraph has two major issues: (1) there is not a standard definition of directed hypergraph and (2) there is not a formal definition of nested hypergraph. To resolve these issues, we propose a new definition of hypergraph that unifies the concepts of undirected, directed and nested hypergraphs, and that is uniform in using hyperedge as a single construct for representing high-order correlations among things, i.e., nodes and hyperedges. Specifically, we define a hyperedge to be a simple hyperedge, a nesting hyperedge, or a directed hyperedge. With this new definition, a hypergraph is nested if it has nesting hyperedge(s), and is directed if it has directed hyperedge(s). Otherwise, a hypergraph is a simple hypergraph. The uniformity and power of this new definition, with visualization, should facilitate the use of hypergraph for representing (hierarchical) high-order correlations in general and chemical systems in particular. Graph has been widely used as a mathematical structure for machine learning on molecular structures and 3D molecular geometries. However, graph has a major limitation: it can represent only pairwise correlations between nodes. Hypergraph extends graph with high-order correlations among nodes. This extension is significant or essential for machine learning on chemical systems. For molecules, this is significant as it allows the direct, explicit representation of multicenter bonds and molecular substructures. For chemical reactions, this is essential since most chemical reactions involve multiple participants. We propose the use of chemical hypergraph, a multilevel hypergraph with simple, nesting and directed hyperedges, as a single mathematical structure for representing chemical systems. We apply the new definition of hypergraph to chemical hypergraph and, as simplified versions, molecular hypergraph and chemical reaction hypergraph.
Might want to check out hypergraphdb.org
The unit of storage is a tuple made up of 0 or more other tuples.
This too seems similar: wiki.opencog.org/w/AtomSpace
It hints at one of the complexities involved. Some atoms become executable. For example: Queries themselves are graphs. Atomese language is Turing-complete.
I've looked at nested hypergraphs before, but atomspace is interesting, thanks!
Can't you do this just with a unique ID/foreign key...? BELIEVES is a relation that takes a tuple as the 2nd parameter.
(SOCRATES, IS, MORTAL), 123
(SOCRATES, BELIEVES, 123), 124
RDF* seems to do it that way, putting names on subgraphs and referencing those.
@Chris, I'm not sure how that representation would distinguish between asserting the fact itself, and merely asserting that the fact is believed. Statement 124 should be possible to express even if statement 124 is not known to be true.
Yeah you would have to separately distinguish which statements are TRUE. This would just be a particular relation though? It's no different from which statements are FUNNY or INTERESTING
It is slightly different, because how do you know that 461 is true?
Yeah you're right in that if you want to do some kind of meta level inference there's a special kind of "true" which is USER thinks X is true
But you presumably want to preserve the ability to say that Socrates believes 1 = 2 is true etc though
yes, but I am not sure it has to look the same as "these are the things the system should believe to be true"
If I have to look to see if the statement exists, and then also check if it is listed as true, that's a lot of needless cognitive overhead for the human user. And as the nesting gets deeper, reading statements is going to get harder and harder.
The reason the relational model forbids this — the reason it is restricted to first order logic — is that this runs into the halting problem.
Roughly, once relations can refer to other relations, they can form arbitrarily complex structures where queries never finish. The relational model is, almost mathematically so, the richest model you can have where queries always finish.
Note that in the above suggestions where you refer to other relations by a code, you have to go into your external, Turing-complete language to use this.
Not a reason a database should not have this feature, of course, but this might be a useful nuance to be aware of.
Every programming language runs into the halting problem, and yet, we don't care. Why do we care for a knowledge language?
SQL has poisoned the entire industry. So you’re used to thinking of a relational database as this heavyweight, fixed sort of thing with a terrible query language.
The correct way to understand the relational model is that it lets you separate your first order logic (FOL) computation from the stuff that should be (either logically or pragmatically) done in a Turing complete language.
The vast majority of the logic in just about any program can be written in FOL. Folks don’t even think about doing that because SQL is terrible.
One of the great features of the relational model is that precisely because of its limits, optimisation is a relatively simple problem. We should be able to use relations freely across all of our programming tasks. The result would be that you can just declaratively specify what comes from what, and the query optimiser will make it work efficiently.
These considerations still apply to SQL databases. If you execute a query, it will eventually finish and when it does, you’re guaranteed that if some data satisfies your query, it will be retrieved.
If we use a database that is Turing complete, you can write something that is logically correct, but the query engine might not be able to determine that. It is a great virtue that if something is true of your data, if you can express it in FOL, your database can find it.
I have difficulty gauging how practical a virtue the guarantee of termination really is. I can end up with non-terminating code elsewhere, and I just fix it. Is the benefit perhaps that someone cannot add data that makes a query stop terminating that used to terminate? Because not a lot of people are letting users run arbitrary untested queries against their data. So it feels like you would know in advance it will terminate unless the data can stop that from happening, maybe?
The relational model is, basically, “essence of declarative programming”. 2nd order logic, subject to halting problems, currently mostly needs to be written by humans. But the rest of it can be declarative — most of most programs, in fact. The part you can express in FOL benefits from this essence of declarative stuff insofar as you can just declare the rules of what facts follow from what other facts, and then the computer writes an efficient algorithm to implement the logic for you.
The relational model provides a theory for generating code from declarations about what is true and what follows from what.
Don't get me wrong. I'm sold on declarative logic code. Love it, terminating issues and all. Big fan. I just am not sold on the FOL limitation, and I'm having difficulty believing that anyone without a comp sci degree really cares about guaranteed termination that much, unless it is guaranteed against changes in the data.
But that right there is one reason for the design. If you have a language in which someone ~can’t express~ a non-terminating query, they don’t have to understand any of that, but they know they have a completely reliable system that always answers their questions.
I’m curious too. If every query has a 60-second deadline, valid FOL and SOL queries will both sometimes fail. How much does it matter why? Are the performance problems harder to diagnose and fix with the more complicated languages? Are they more common?
I was playing with the idea of a query language the other day and realized that I was limiting myself to FOL. So I questioned the choice.
I came to realize that apps will need additional flexibility in how they process data before it can be presented, and limiting the query language wasn’t going to make that requirement go away. All I was doing was forcing algorithms to be split across database and client, instead of allowing the entire algorithm to be done in one language. Worse, if the seam is in the middle of an algorithm, I might be forcing multiple network calls, with all the de/serialization that entails, to satisfy one semantic query. For what? So that I can say that every operation my database supports is fast? Wouldn’t I rather say that my database is easy to use?
I decided that I’d rather bring relational operations to a more complex language, and just isolate them enough that they can be plucked out by an optimizer.
Aha! You discovered the weakness in my cunning ruse!
Not really. You can certainly write SQL queries that will time out. But in practice, those are rare, and if you had to solve the same problem in a Turing-complete language, there’s a good chance you have no other way to fix it.
Computation complexity is always with us, either way.
And in practice, there might be some things that are easier to write in a TC language. But I’d have to struggle with some quite unusual sort of data for a while to produce an example.
So: a relational database can represent most of most programs more cleanly and simply than any mainstream TC language can, and it does a lot of the programming for you, particularly in terms of working out what order it should process each part of your logic in. And, because it can do that dynamically and automatically, it can retain efficiencies even as you change the model, whereas the typical TC solution might need substantial re-engineering to retain efficiency.
There is absolutely nothing wrong with a higher-order/TC query/logic language. Prolog is one such, is very useful for some types of problems, and is I believe easier for non-developers to learn than most TC languages. Mercury is another such, not so beginner friendly but very expressive and flexible.
Codd’s original conception was that the relational database engine would be used in combination with a TC language. Predicates for filters, for example, are not part of the relational theory but are just supplied from a TC language.
So your idea for a language that contains both but separates them is not terrible. Although even with SQL, you can see the advantage of separating the relational side. One of the most underappreciated advantage of a separate relational storage engine is that you can use the same data store from multiple programs, written in different languages.
SQLite is interesting, in that actually leans a fair way into injecting the TC language. The features are not as well known as they should be, and depend on whether the language embedding supports it, but in SQLite, the host application/language can potentially define:
- regular functions;
- aggregate functions; and;
- (what most folks don’t know) custom type deserialisers.
Postgres has close similar features also.
The fact that all queries terminate isn't the actual benefit. As a result of that fact, the user can alter the data arbitrarily, and there is no possibility that by doing so they cause queries to become non-terminating. That's the actual upside. The user can't break it. If you are using a higher-order system, you have to either a) be careful not to use potentially infinite structures in your queries (in which case why not just use FPL), or b) guard against the user adding data that causes infinite loops. For my use case, b) was already taken care of in the context of the limited queries I have. Thanks for the enlightening chat!
You certainly need TC behaviour to be available. But separating what you can into the relational model has enormous benefits in terms of simplicity and flexibility.