| A database is an information set with a regular structure.
Any set of information may be called a database. Nevertheless, the term was invented to refer to computerised data, and is
used almost exclusively in computing. Sometimes it is used to refer to not yet computerised data, but usually in the process of
planning its possible computerisation.
Software created to manage generalised databases is usually called a database management system or DBMS. Several software architectures are possible: For smaller
single user databases often all functions are managed by one program. In larger and multiple user databases usually a number of
programs are involved and most commonly a client-server architecture is
adopted.
The DBMS front-end (i.e., the clients) is concerned mainly with data entry,
enquiry and reporting. The back-end (i.e. the server) is the set of
programs that actually control data storage, responding to requests from the frontend. Searching and sorting is usually performed
by the server. There are a wide variety of database implementations, from simple tables stored in a single file each to very
large databases with many millions of records, stored in rooms full of disk drives or other peripheral electronic storage
devices.
Databases resembling modern versions were first developed in the 1960s. A pioneer in the field was Charles Bachman. Two
key data models arose at this time: the network model (developed by CODASYL) followed by the hierarchical model (as implemented in IMS). These were later usurped by the relational model,
which was contemporary with the so-called flat model designed
for very small tasks. Another contemporary of the relational model
is the object-oriented database (OODB).
While the relational model is based on set theory, one proposed modification suggests fuzzy set theory (based on fuzzy logic) as an alternative.
Database Models
Various techniques are used to model data structure. Certain models are more easily implemented by some types of database
management systems than others. For any one logical model various physical implementations may be possible. An example of this is
the relational model: In larger systems the physical implementation often has indexes which point to the data - this is similar
to some aspects of common implementations of the network model. But in small relational databases the data is often stored in a
set of files, one per table, in a flat, unindexed structure. There is some confusion below and elsewhere in this article as to
logical data model vs its physical implementation.
The flat (or table) model consists of a single,
two-dimensional array of data elements, where all members of a given column are assumed to
be similar values, and all members of a row are assumed to be related to one another. For instance, columns for name and password
might be used as a part of a system security database. Each row would have the specific password associated with a specific user.
Columns of the table often have a type associated with them, defining them as character data, date or time information, integers,
or floating point numbers. This model is the basis of the spreadsheet.
The network model allows multiple datasets to be used together
through the use of pointers (or references). Some columns contain pointers to different tables instead of data. Thus, the tables
are related by references, which can be viewed as a network structure. A particular subset of the network model, the hierarchical model, limits the relationships to a tree structure,
instead of the more general directed graph structure implied by the full
network model.
The relational model was introduced in an academic
paper (http://www.acm.org/classics/nov95/toc.html) by E. F. Codd in 1970 as a way to make database
management systems more independent of any particular application. It is a mathematical model defined in terms of predicate logic and set
theory.
Although the basic idea of relational database management systems has been very popular, relatively few people understand the
mathematical definition and only a few, obscure DBMSs implement it completely and without extension. Oracle, for example, can be used in a purely relational way, but it also
allows tables to be defined which allow duplicate rows -- an extension upon (or violation of) the relational model. In common
English usage, a DBMS is called relational if it supports relational operations, regardless of whether it enforces strict
adherence to the relational model. The following is an informal, non-technical explanation of how "relational" database
management systems commonly work.
A relational database contains multiple tables, each similar to the one in the "flat" database model. However, unlike network
databases, the tables are not linked by pointers. Instead, "keys" are used to match up rows of data in different tables. A key is
just one or more columns in a table. Any of the columns in a table can be a key, or multiple columns can be grouped together into
a single key. Unlike pointers, it's not necessary to define all the keys in advance; a column can be used as a key even if it
wasn't originally intended to be one.
When a key consists of data that has an external, real-world meaning (such as a person's name, a book's ISBN, or a car's serial number), it's called a "natural" key. If no natural key is suitable, an arbitrary key
can be assigned (such as by giving employees ID numbers). In practice, most databases have both generated and natural keys,
because generated keys can be used internally to create links between rows that can't break, while natural keys can be used, less
reliably, for searches and for integration with other databases. (For example, records in two independently developed databases
could be matched up by social security number, except when the social security numbers are incorrect, missing, or have
changed.)
You request data from a relational database by sending it a query that's written in a special language, usually a dialect of
SQL. Although SQL was originally intended for end-users, it's much more common for SQL
queries to be embedded into software that provides an easier user interface. (Many web sites perform SQL queries when generating
pages.)
In response to a query, the database returns a result set, which is just a list of rows containing the answers. The simplest
query is just to return all the rows from a table, but more often, the rows are filtered in some way to return just the answer
wanted. Often, data from multiple tables gets combined into one, by doing a "join". Conceptually, this is done by taking all
possible combinations of rows (the "cross-product"), and then filtering out everything except the answer. In practice, relational
database management systems rewrite ("optimize") queries to perform faster, using a variety of techniques: In the "join" the
primary optimisation is obtained through the use of indexes to prevent the building of the complete cross-product which would
otherwise be necessary.
The flexibility of relational databases allows programmers to write queries that were not anticipated by the database
designers. As a result, relational databases can be used by multiple applications in ways the original designers did not foresee,
which is especially important for databases that might be used for decades. This has made the idea and implementation of
relational databases very popular with businesses.
Implementations and indexing
All of these kinds of database can take advantage of indexing to increase their speed. The most common kind of index is a
sorted list of the contents of some particular table column, with pointers to the row associated with the value. An index allows
a set of table rows matching some criterion to be located quickly. Various methods of indexing are commonly used; b-trees, hashes, and linked lists are all common indexing techniques.
Relational DBMSs have the advantage that indices can be created or dropped without changing existing applications, because
applications don't use the indices directly. Instead, the database software decides on behalf of the application which indices to
use. The database chooses between many different strategies based on which one it estimates will run the fastest.
Relational DBMSs utilise many different algorithms to compute the result of an SQL
statement. The RDBMs will produce a plan of how to execute the query, which is generated by analysing the run times of the
different algorithms and selecting the quickest. Some of the key algorithms that deal with joins are Nested Loops Join, Sort-Merge Join and Hash Join.
Mapping objects into databases
In recent years, the object-oriented paradigm has been applied to
databases as well, creating a new programming model known as object
databases. These databases attempt to overcome some of the difficulties of using objects with the SQL DBMSs. An
object-oriented program allows objects of the same type to have different implementations and behave differently, so long as they
have the same interface (polymorphism). This doesn't fit well with a SQL database where user-defined types are
difficult to define and use, and where the Two Great Blunders prevail: the identification of classes with tables (the correct
identification is of classes with types, and of objects with values), and the usage of pointers.
A variety of ways have been tried for storing objects in a database, but there is little consensus on how this should be done.
Implementing object databases undo the benefits of relational model by introducing pointers and making ad-hoc queries more
difficult. This is because they are essentially adaptations of obsolete network and hierarchical databases to object-oriented
programming. As a result, object databases tend to be used for specialized applications and general-purpose object databases have
not been very popular. Instead, objects are often stored in SQL databases using complicated mapping software. At the same time,
SQL DBMS vendors have added features to allow objects to be stored more conveniently, drifting even further away from the
relational model.
Applications of databases
Databases are used in many applications, spanning virtually the entire range of computer software. Databases are the preferred method of storage for large multiuser applications, where
coordination between many users is needed. Even individual users find them convenient, though, and many electronic mail programs
and personal organizers are based on standard database technology.
Transactions and concurrency
In addition to their data model, most practical databases attempt to enforce a database transaction model that has desirable data integrity properties. Ideally, the database
software should enforce the ACID rules, summarised here:
- Atomicity - Either all the tasks in a transaction must be done, or none of
them. The transaction must be completed, or else it must be undone (rolled back).
- Consistency - Every transaction must preserve the integrity constraints
-- the declared consistency rules -- of the database. It cannot place the data in a contradictory state.
- Isolation - Two simultaneous transactions cannot interfere with one another.
Intermediate results within a transaction are not visible to other transactions.
- Durability - Completed
transactions cannot be aborted later or their results discarded. They must persist through (for instance) restarts of the DBMS
after crashes.
In practice, many DBMS's allow most of these rules to be selectively relaxed for better performance.
Concurrency control is a method used to ensure
transactions are executed in a safe manner and follows the ACID rules. The DBMS must be able to ensure only serializable,
recoverable schedules are allowed, and that no actions of committed transactions are lost while undoing aborted transactions.
The Database State
While databases are a vital tool of modern life, concerns are growing over the privacy and civil liberties issues that result,
particularly with governments using increasingly larger interlinked databases with their potential for mass surveillance in what has been termed by some the 'Database
State'.
References
|