There are two kinds of ways to query a relational database to retrieve information. The first one is to expose database (SQL) scripting interface to the user. The user has complete access to all objects and operations allowed by the database. This kind of user interfaces are usually reserved for the database administrators who have thorough understanding of SQL scripting language and the structure of the corresponding database. The results usually have very poor display qualities. It is not suitable to be used inside a product that is designed for more ordinary users who has limited knowledge about database and corresponding low level data structure of a system. It also open the system widely to a variety of security attacks, information leaks, etc..
The second one is to encapsulate all query and other actions into a predefined set of fixed methods to be invoked by the user interface. It is easier for a user to use and, if designed properly, is much more secure, but it is very limited in functionality since all operations are predefined by the producer. To develop a sufficiently good user interface that covers even a small portion of possible operations on the database requires exponentially increasing amount of efforts, especially when the number of the tables inside the database is large. The results are still none-extensible and limited systems that were fixed from the very beginning no matter how much resources the producer has and how hard it tries. Most data applications today on the market are produced in this way supported by economically reasonable resources. Going further, the producer must justify not only the development costs but also maintenance costs and even possibility due to increased complexity of the system.
The second approaches are becoming more and more heavily rely upon some form of object-relational mapping (ORM) techniques, especially when developing applications that are sufficiently complex. Most of the more successful programming languages used to build user interface applications are object oriented (OO) ones. One of the important successes of some of the OO languages is the possibility for a developer to adopt polymorphism to represent related types or concepts that are not only be concrete but also abstract so that a set of more abstract interfaces for them can be declared. The simplicity in coding style and extensibility, the degree of code reuse and ease of project maintenance can be increased significantly in this kind of practices that are designed properly. Data inside a relational database are organized differently. ORM happens on many levels in an application system. Here we are interested on the data level. If one make the following map for comparison purposes, namely,
A table schema in RDB world maps to a type in OO world;
A table row in RDB world maps to an object in OO world;
A row field in RDB world maps to a property of an object in OO world.
One can see that there is no concept of polymorphism in the RDB world. Instead of inheritance, RDB developer can use composition to reach similar goals, which can be less elegant than the OO approach.
Let’s focus on the design of a hypothetical inclusive animal database on earth. The reasoning, like “a cat is an animal” so that “any action that can be performed on an animal can also be tried to be performed on a cat, pending approval from the owner or the cat or a cat rights organization or whatever, the result will depend on the cat-ness of the said animal”, cannot be represented in RDB system well. Queries like “select (d as cat).* from [the animal set] where d is cat AND (d as cat).Whiskers<2 …”, where “whisker” is a property that belongs only to the cat family, has no support to be executed. each concrete member of the animal family has to have a separated table pre-defined and they can’t be mixed or inherited. given the diversity of the animal family, the animal database can become very had to maintain and expensive to develop, if possible at all. the abstract source, “[the animal set]”, makes it simple and extensible later as the database grows, provided that the underlying database can be implemented.
The reader may think that is what an object oriented database OODB can do! Depending on a user’s needs, this maybe is true. What we are after is in some critical details. The concept of OODB is of course not new; there are already many implementations. One of the qualities that differentiate them is how easy the database can be queried. Second, the design of OODB data structure, due to its heavy bound to OO languages, intends to be more of hierarchical rather than relational. Data in RDB can be mapped to graphs in the OO world, which is not use explicitly by many developers in the OO world. This state of affair is manifested in the fact that even some serialization libraries build into language framework cannot preserve a data object graph without polymorphism faithfully, reflecting the lack of care about relational aspects of the things in the object world. Including polymorphism, almost none of them can do it naturally yet. The more relational they can represent, the more they reflect the fact that many data in the real world cannot be fitted into hierarchic structure naturally. That is one of the reasons why relational databases went into relational direction and became so spectacularly successful at low data layer in the past. The SQL language combined with the relational data structure is also one of the most successful data systems that most of these existing OODB can’t match yet. It would be ideal to develop a data system that is hybrid of relational and hierarchic and implement an extension of the SQL language (at least partially) to allow detailed queries mentioned above, to be executed dynamically in response to a client (or user)’s runtime input. Namely it has to be relational (in addition to its natural quality of been hierarchical) and scriptable to end users. This is one of the problems we have successfully addressed in our technology. It will be detailed below.
Languages in general can have some syntax rules to define correct sentences. For a language to be useful, the space of its member sentences must be large enough to describe the target objects or concepts
For the proceeding discussion’s sake, not the scholarity one for which the present white paper has no intension to get into, let’s take the duality point of view that the syntax of a language is a knowledge encoder that condenses certain set of pre-existing and general knowledge into rules for constructing correct sentences in the language and for eliminating the rest ones, which are much large in number, as nonsenses, noises or errors. Natural languages are evolved to be large enough to describe all human affairs. They became so large that only what E. Kant called a priori knowledge of human being (some philosophers call them “justified true beliefs that are independent of existence”) are encode into the syntax when formalized. In this point of view, syntax encoded knowledge function as rules, axioms or beliefs rather than reflecting certain material facts, since any material fact has to be describable by them.
Computer languages are of at least two kinds: one is general language type and the other is domain specific language type.
Since general computer language must describe how to control the operation of a Turing Machine. A Turing machine is an abstract machine but it is physically realizable. So sentences in a general computer language produced by a human are restricted by above mentioned a priori knowledge and the fact that they have also be understood by a Turing Machine. Therefore knowledge of how a Turing Machine works is encoded into a general computer language.
Domain specific languages (DSL) are used to describe specific targets inside a specific knowledge domain which are known well enough to have its context boundary clearly and formally defined. They have more knowledge build-in as rules, which were gained in the domain context knowledge acquisition processes that were already done. Sentences that are otherwise correct in general by do not refer to domain targets or are against established domain knowledge are treated as none-sense, noises or errors. From this point of view, it can be seen that domain languages can have hierarchies from the more general one to more specific ones, starting from general languages are roots. DSL with interpreter or compiler implemented can simplify the programming task significantly because of the reduction in information input required from a user since information derived from domain knowledge has already been assumed to be true.
DSL are used widely in software application, including but not limited to: graphic scripting languages for controlling the plot of graph or figures, like Graphviz, gnuplot; HTML document model manipulation script java-script; relational data query language SQL; MATLAB for technical computing, etc..
What have all these to do with the present white paper? You may ask. A lot is the answer.
Meta-languages introduced in this white paper are languages used to describe and/or operate on other languages. More specifically, they are about code generation system which can be specified by a DSL belonging to meta-languages. The kinds of meta-language referred to by this white paper are hierarchic meta-languages, from more general ones starting from initial design of the system to more specific ones in which further user specification are encoded, in the pipeline of software production. The more general meta-languages act at least in part to generate the next level of more specific ones. In the duality point of view mentioned above, each level of meta-language inherits the knowledge or specifications of their parent one, resulting in an unlimited variation for the end products at each level.
Using meta-language driven code generation system in software production process has many benefits, including but not limited to: 1) quality assurance and increased code reusability; 2) consistent requirement interpretation and implementation, which now are encoded as a meta-language at certain level; 3) repeatability in a iterative production process and post production maintenance process; 4) reduction in cost of production; 5) much more increased information content, inherited from levels of meta-languages producing them, which provide bases for an increased intelligence, in the end products; 6) easier end user participation in the customization of the end products; etc.. The potentials are endless.
But good things do come with costs in the sense that meta-language building and usage is an area that is not easy to get into, at least for majority of developers. Developing hierarchical meta-languages needs computer assistant tools (i.e. code generators), especially for non-top level ones, because it is hard to accomplish manually in real production cases. Currently there is very little published information about meta-language driven code generations. Obvious this is an area that is quite advanced and also very technical, many software producers may use them one way or another in their automation pipeline, although most of which use them only as supplementary parts or steps, not something that “drives” the process.
Instead of meta-language SQL is a DSL for querying relational data sets. SQL is can be applied to any relational data. To apply it to a particular relational database instance, the user must has the knowledge for the structure of the database and add this knowledge into the expressions (sentences) targeting the said database instance. He/she could make mistakes by misuse the knowledge, for example the name of a table is not written right. The mistake will not be found until the expression is executed against the database engine during run-time, which may result in unwanted effects, say, causing a disaster.
Following the instruction above, one can see that DSL can also have multiple levels. Now suppose one can develop a sub-class of SQL language that targets a specific database instance, then the structural knowledge can be built into the syntax of the language and will be enforced by the code “generator”, the “compiler” or “interpreter” of the language, the potential of a disaster can be systematically eliminated from the very beginning. The system could also runs faster since no-time has to be wasted on checking query expression conform the underline structure of the target database instance, e.g. making sure that a named table does exist, at runtime.
Experienced developer may think that developing a general language like SQL is already hard enough (since the more generic the language, the less information it contains, and therefore the simpler it becomes to produce a compiler with regard to information encoding aspect of things. Of course general languages are most likely harder to produce since they are usually finer in implementation and must consider far more application scenarios than specialized ones), produce and maintain a DSL that must be synchronized with an evolving database instance is something not worth doing, especially when the database instance has many tens or hundreds of tables. The enormous information encoding and synchronization task could overwhelm even a talented developer, easily.
This is indeed the case using current tools available on the market. In many sense, this is no longer true in our system.
Languages supporting OO style of program normally have rich support for a kind of type system and even generics. The type system is used in developing a particular software system to encode some of the domain knowledge mentioned above in which the application is targeting, using which the compiler can perform compile time checking on the conformance of the code to eliminate certain kind of errors. OO programming practice supported by a well-designed type system and generic has a lot of advantages in the view of software engineering, too much to be listed here. In addition to the ones mentioned in the “Databases” section, a short list will be like: e.g., it helps human to understand, maintain code and organizational intellectual knowledge and with the help of the compiler to eliminate errors, etc.. It in most cases is a burden for the computer to execute, like resolving concrete types and virtual methods. That is why most of higher level languages, like C++, are slower than its lower level correspondence, like C. In addition, for the same functionality set, executable and libraries produced by higher level languages is much larger than the ones produced by the lower level correspondences partly because type meta information needed by humans are in some way reproduced in the output binaries
This situation can be improved in our duality point of view and practices mentioned above. Here the code generators (scripts) and corresponding code generation meta-languages are both used to maintain, at least partially, the meta-information that otherwise will be encoded into the type system of the software. The information encoded into the code generators will never be explicitly output into the final binaries. When applied correctly, it will result in smaller executable footprint (for the same functionality set and information content), higher execution speed and a reduced possibility of random errors in encoding of domain knowledge in a repetitive or iterative development processes. In a non-open source system, it also provides a way for a developer to manage its intellectual properties.
Little public information on similar practices can be found so far. Perhaps we are the only that are doing so at present.
Code generators of first kind of are passive ones that are used to guide a user to generate syntactically correct sentences of current level of language. Code generators of the second kind are active generators driven by the output of the first one to generate software components.
IntelliSense or auto-completion technologies are widely used in software integrated development environment (IDE). The quality of the said system differs from a vender to another. The code generator of the first kind plays similar role when used it is called generator instead of aid because the code generated by random selection form the auto-completion options cannot have syntax errors, and it has to able to generate all possible sentences in a language, no more and no less. Most IntelliSense or auto-completion scheme cannot reach that precision at present.
We have made significant progress in various forms at this front. The code generator for the sub-SQL DSL (see above) appears in our end products, like the file system database introduced here, that can intelligently guide an ordinary user to perform complex query, either visually or by using a text editor.
Code generators of second kind are interpreter of the corresponding meta-language. They are either auto-generated by meta-languages at a higher level or manually written or mixed. Code generators are not rare on the market, but most of them are visual user interface based. The ones driven by a meta-language whose scripts are generated by the corresponding generator of the first kind are unaware of by us.
Sqlization is not an English word (yet), it is however a symbol used by us to denote the process of turning unstructured but correlated data sets into a relational database against which SQL like queries can be performed. “Virtual” means that during the sqlization process, the underlying data are neither copied nor altered and the database tables are virtual (i.e., non-existing) so that the database instance is actually a virtual one. The system build the virtual database during startup time using the current snapshot of the file system as input and listen to the changes within at least part of the file system and update its content in response.
Here the data sets are unstructured in relational data sense, they could well have other structures which are not or cannot be directly used by the process to be described. For example the said data is a document with rich internal structure or the data is a webserver log entry that contains field items that can be categorized into different, but related virtual data tables.
Sqlization requires the creation of artificial relational database schema that for the target data system that can be mapped to the original data schema of the target system, depending on how the sqlization is expected. Most likely it could contain more information (potential or space) than the original one for extra data correlations to be build and recorded into the new database, making the sqlized system richer in content and information.
The said virtualization has many benefits. First, the target data sets could be very large and change constantly, like a user’s file system inside his/her operating system, it is not realistic to copy, transform them all into a concrete relational database instance and keep the two copy synchronized. Second, there are cases in which the sources of the data sets can be redirected, merged, or disconnected. Virtualization creates a layer of data indirection which can be programmed to redirect to, merge with, or disconnect from data sources either at configuration level or even during the run-time, making the system expandable.
It is obvious that sqlization of data sets exists the same time as the advent of relational database since data has to be structurally formatted and written into the database somehow. It is the way how they are written into the database that makes the difference. The current status of sqlization technology is too broad topic to be covered here since there are unlimited kinds of data systems. Let’s concentrate on the file system from now on.
There are many database file system on the market. They are file systems build on top of a relational database. The original data (files) are stored into a database in these systems so they are backed by relational database. Sqlization means something different. It means turning an existing data system into a relational database instance. There are a few of these kinds of products on the market.
There is little information on the virtual sqlization of file system or any other data system published at present.