What is Data Normalization ?

Normalization is the process of efficiently organizing data in a database. There are two goals of the normalization process: eliminating redundant data (for example, storing the same data in more than one table) and ensuring data dependencies make sense (only storing related data in a table). Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.

The Normal Forms

The database community has developed a series of guidelines for ensuring that databases are normalized. These are referred to as normal forms and are numbered from one (the lowest form of normalization, referred to as first normal form or 1NF) through five (fifth normal form or 5NF). In practical applications, you'll often see 1NF, 2NF, and 3NF along with the occasional 4NF. Fifth normal form is very rarely seen and won't be discussed in this article.
Before we begin our discussion of the normal forms, it's important to point out that they are guidelines and guidelines only. Occasionally, it becomes necessary to stray from them to meet practical business requirements. However, when variations take place, it's extremely important to evaluate any possible ramifications they could have on your system and account for possible inconsistencies. That said, let's explore the normal forms.

First Normal Form (1NF)

First normal form (1NF) sets the very basic rules for an organized database:

  • Eliminate duplicative columns from the same table.
  • Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).

Second Normal Form (2NF)

Second normal form (2NF) further addresses the concept of removing duplicative data:

  • Meet all the requirements of the first normal form.
  • Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
  • Create relationships between these new tables and their predecessors through the use of foreign keys.

Third Normal Form (3NF)

Third normal form (3NF) goes one large step further:

  • Meet all the requirements of the second normal form.
  • Remove columns that are not dependent upon the primary key.

Fourth Normal Form (4NF)

Finally, fourth normal form (4NF) has one additional requirement:

Remember, these normalization guidelines are cumulative. For a database to be in 2NF, it must first fulfill all the criteria of a 1NF database.


What is Distributed Database Management System ?

A distributed database management system is a software system that permits the management of the distributed database and makes the distribution transparent to the users. A distributed database is a collection of multiple, logically interrelated databases distributed over a computer network. Sometimes distributed database system is used to refer jointly to the distributed database and the distributed DBMS.
Distributed database management systems can be architected as client-server systems or peer-to-peer ones. In the former, one or more servers manage the database and handle user queries that are passed on by the clients. The clients usually have limited database functionality and normally pass the SQL queries over to the servers for processing. In peer-to-peer systems, each site has equal functionality for processing.

A distributed database is a database that is under the control of a central database management system (DBMS) in which storage devices are not all attached to a common CPU. It may be stored in multiple computers located in the same physical location, or may be dispersed over a network of interconnected computers.
Collections of data (eg. in a database) can be distributed across multiple physical locations. A distributed database is distributed into separate partitions/fragments. Each partition/fragment of a distributed database may be replicated (ie. redundant fail-overs, RAID like).
Besides distributed database replication and fragmentation, there are many other distributed database design technologies. For example, local autonomy, synchronous and asynchronous distributed database technologies. These technologies' implementation can and does definitely depend on the needs of the business and the sensitivity/confidentiality of the data to be stored in the database. And hence the price the business is willing to spend on ensuring data security, consistency and integrity.

Basic architecture

A database server is the software managing a database, and a client is an application that requests information from a server. Each computer in a system is a node. A node in a distributed database system act as a client, a server, or both, depending on the situation.
Horizontal fragments
subsets of tuples (rows) from a relation (table).
Vertical fragments
subsets of attributes (columns) from a relation (table).
Mixed fragment
a fragment which is both horizontally and vertically fragmented.
Homogeneous distributed database
uses one DBMS (eg: Oracle).
Heterogeneous distributed database
uses multiple DBMS's (eg: Oracle and MS-SQL and PostgreSQL).
Users access the distributed database through:
Local applications
applications which do not require data from other sites.
Global applications
applications which do require data from other sites.


Important considerations

Care with a distributed database must be taken to ensure that:
  • The distribution is transparent — users must be able to interact with the system as if it was one logical system. This applies to the systems performance, and methods of access amongst other things.
  • Transactions are transparent — each transaction must maintain database integrity across multiple databases. Transactions must also be divided into subtransactions, each subtransaction affecting one database system.

Advantages of distributed databases

  • Reflects organizational structure — database fragments are located in the departments they relate to.
  • Local autonomy — a department can control the data about them (as they are the ones familiar with it.)
  • Improved availability — a fault in one database system will only affect one fragment, instead of the entire database.
  • Improved performance — data is located near the site of greatest demand, and the database systems themselves are parallelized, allowing load on the databases to be balanced among servers. (A high load on one module of the database won't affect other modules of the database in a distributed database.)
  • Economics — it costs less to create a network of smaller computers with the power of a single large computer.
  • Modularity — systems can be modified, added and removed from the distributed database without affecting other modules (systems).

Disadvantages of distributed databases

  • Complexity — extra work must be done by the DBAs to ensure that the distributed nature of the system is transparent. Extra work must also be done to maintain multiple disparate systems, instead of one big one. Extra database design work must also be done to account for the disconnected nature of the database — for example, joins become prohibitively expensive when performed across multiple systems.
  • Economics — increased complexity and a more extensive infrastructure means extra labour costs.
  • Security — remote database fragments must be secured, and they are not centralized so the remote sites must be secured as well. The infrastructure must also be secured (eg: by encrypting the network links between remote sites).
  • Difficult to maintain integrity — in a distributed database enforcing integrity over a network may require too much networking resources to be feasible.
  • Inexperience — distributed databases are difficult to work with, and as a young field there is not much readily available experience on proper practice.

DAY 4 : Directory Service

What is Directory Service ?

A directory service is a software application — or a set of applications — that stores and organizes information about a computer network's users and network shares, and that allows network administrators to manage users' access to the shares. Additionally, directory services act as an abstraction layer between users and shared resources.
A directory service should not be confused with the directory itself, which is the database that holds the information about objects that are to be managed by the directory service. The directory service is the interface to the directory and provides access to the data that is contained in that directory. It acts as a central authority that can securely authenticate resources and manage identities and relationships between them.
A directory service is highly optimized for reads and provides advanced search on the many different attributes that can be associated with objects in a directory. The data that is stored in the directory is defined by an extendible and modifiable schema. Directory services use a distributed model for storing their information and that information is usually replicated between directory servers. [1]
A directory service maps the names of network resources to their respective network addresses. The user doesn't have to remember the physical address of a network resource; providing a name helps locate the resource. Each resource on the network is considered as an object on the directory server. Information about a particular resource is stored as attributes of that object. Information within objects can be made secure so that only users with the available permissions are able to access it.
A directory service defines the namespace for the network. A namespace is a set of rules that determine how network resources are named and identified. The rules specify that the names be unique and unambiguous. In LDAP the name is called the distinguished name (DN) and is used to refer to a collection of attributes which make up a directory entry.
Directory service is a tool for locating, managing, administrating, and organizing network resources, which can include volumes, folders, files, printers, users, groups, and other objects. A directory service is a center component of NOS (Network Operating System) and it has functions much like phone book. For example, if you look up “computers,” you will find a list of available computers and information for accessing them, however, for a directory service to function like a network phone book, it must be combined with a matching entity, the directory database. Examples of directory services produced by different vendors and standards bodies include the following: Windows NT Directory Services (NTDS) for Windows NT Active Directory for Windows 2000 Novell Directory Servies (NDS) for Novell NetWare version 4.x



Comparison with relational databases

There are a number of things that distinguishes a directory service from a relational database
  • In a directory, the information is generally read more often than it is written. Hence the usual database features of transactions and rollback are not implemented in a directory. Data may be made redundant, but the objective is to get a faster response time during searches.
  • Data is organized in a strictly hierarchical manner. In terms of cardinality, directories do not have many-to-many relations. Instead, such relations must be maintained explicitly using lists of distinguished names.
  • Attributes are defined as stand-alone schema entities. That is: they exist without being part of an object and have a fixed type.
  • An objectClass has:
    • Must-attributes that each of its instances must have
    • May-attributes that can be defined for an instance, but could also be omitted when the object is created. The lack of a certain attribute is somewhat like a NULL in relational databases
  • Attributes are often multi-valued in directories whereas this is seen in relational database design as a denormalized state of the database schema.
  • Attributes and objectClasses are standardized throughout the industry and formally registered with the IANA for their object ID. Therefore directory applications seek to reuse much of the standard classes and attributes to maximize the benefit of existing directory server software.
  • Object instances are slotted into namespaces. That is, each objectClass inherits from its parent objectClass (and ultimately from the root of the hierarchy) adding attributes to the must/may list.
  • Directory services are often a central component in the security design of an IT system and have a correspondingly fine granularity regarding access control: who may operate in which manner on what information. Also see: ACLs
Directory design, compared to relational databases, therefore feels much more like customizing an industry-standard rather than a conceptualization of a real-life situation. Different rules apply regarding data modeling and the access scenarios are different.

Implementations of Directory Services

Directory services were part of an Open Systems Interconnect (OSI) initiative to get everyone in the industry to agree to common network standards to provide multi-vendor interoperability. In the 1980s the ITU and ISO came up with a set of standards - X.500, for directory services, initially to support the requirements of inter-carrier electronic messaging and network name lookup. The Lightweight Directory Access Protocol, LDAP, is based on the services of X.500, but uses the TCP/IP stack, giving it more relevance on the Internet.
There have been numerous forms of directory service implementations from different vendors. Among them are:
  • NIS: The Network Information Service (NIS) protocol, originally named Yellow Pages (YP) was Sun Microsystems implementation of a directory service for Unix network environments. (Sun has, in the early 2000s, merged its iPlanet alliance Netscape and developed its LDAP-based directory service to become part of Sun ONE, now called Sun Java Enterprise.)
  • eDirectory: This is Novell's implementation of directory services. It supports multiple architectures including Windows, Netware, Linux and several flavours of Unix and has long been used for user administration, configuration management, and software management. eDirectory has evolved into a central component in a broader range of Identity management products. It was previously known as Novell Directory Services.
  • Red Hat Directory Server: Red Hat released the directory service that it acquired from Netscape Security Solutions as a commercial product running on top of Red Hat Enterprise Linux called Red Hat Directory Server and as part of Fedora Core called Fedora Directory Server.
  • Active Directory: Microsoft's directory service is the Active Directory which is included in the Windows 2000 and Windows Server 2003 operating system versions.
  • Open Directory: Apple's Mac OS X Server offers a directory service called Open Directory which integrates with many open standard protocols such as LDAP and Kerberos as well as proprietary directory solutions like Active Directory and eDirectory.
  • Apache Directory Server: Apache Software Foundation offers a directory service called ApacheDS.
  • Oracle Internet Directory: (OID) is Oracle's directory service, which is compatible with LDAP version 3.

There are also plenty of open-source tools to create directory services, including OpenLDAP and the Kerberos (protocol), and Samba software which can act as a Domain Controller with Kerberos and LDAP backends

DAY 4 : What is Data Warehouse

What is Data Warehouse ?

A Data warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated.This makes it much easier and more efficient to run queries over data that originally came from different sources".Another definition for data warehouse is : " A data warehouse is a logical collection of information gathered from many different operational databases used to create business intelligence that supports business analysis activities and decision-making tasks, primarily, a record of an enterprise's past transactional and operational information, stored in a database designed to favour efficient data analysis and reporting (especially OLAP)". Generally, data warehousing is not meant for current "live" data, although 'virtual' or 'point-to-point' data warehouses can access operational data. A 'real' data warehouse is generally preferred to a virtual DW because stored data has been validated and is set up to provide reliable results to common types of queries used in a business. History of data warehousing
In the 1990's as organizations of scale began to need more timely data about their business, they found that traditional information systems technology was simply too cumbersome to provide relevant data efficiently and quickly. Completing reporting requests could take days or weeks using antiquated reporting tools that were designed more or less to 'execute' the business rather than 'run' the business.
From this idea, the data warehouse was born as a place where relevant data could be held for completing strategic reports for management. The key here is the word 'strategic' as most executives were less concerned with the day to day operations than they were with a more overall look at the model and business functions.
As with all technology, over the course of the latter half of the 20th century, we saw increased numbers and types of databases. Many large businesses found themselves with data scattered across multiple platforms and variations of technology, making it almost impossible for any one individual to use data from multiple sources. A key idea within data warehousing is to take data from multiple platforms/technologies (As varied as spreadsheets, DB2 databases, IDMS records, and VSAM files) and place them in a common location that uses a common querying tool. In this way operational databases could be held on whatever system was most efficient for the operational business, while the reporting / strategic information could be held in a common location using a common language. Data Warehouses take this even a step further by giving the data itself commonality by defining what each term means and keeping it standard. (An example of this would be gender which can be referred to in many ways, but should be standardized on a data warehouse with one common way of referring to each sex.)
All of this was designed to make decision support more readily available and without affecting day to day operations. One aspect of a data warehouse that should be stressed is that it is NOT a location for ALL of a businesses data, but rather a location for data that is 'interesting'. Data that is interesting will assist decision makers in making strategic decisions relative to the organization's overall mission.

Design of data warehouses

Data warehouses often hold large amounts of information which are sometimes subdivided into smaller logical units called dependent data marts. Dependent Datamarts allow for easier reporting by keeping relevant data together in one location.
Usually, two basic ideas guide the creation of a data warehouse:
  • Integration of data from distributed and differently structured databases, which facilitates a global overview and comprehensive analysis in the data warehouse.
  • Separation of data used in daily operations from data used in the data warehouse for purposes of reporting, decision support, analysis and controlling.
Since OLTP databases contain large volumes of data, it is very critical to unload data quickly without adding significant overhead to production database. Periodically, one imports data from enterprise resource planning (ERP) systems and other related business software systems into the data warehouse for further processing. It is common practice to "stage" data prior to merging it into a data warehouse. In this sense, to "stage data" means to queue it for preprocessing, usually with an ETL tool. The preprocessing program reads the staged data (often a business's primary OLTP databases), performs qualitative preprocessing or filtering (including denormalization, if deemed necessary), and writes it into the warehouse.

Dimensions and measures

A data warehouse is created by analyzing ways to categorize data using dimensions and ways to summarize data using measures. Dimensions can be used to filter and navigate summarised data by excluding results or by displaying data in different reporting styles (cross-tabbing). Measures are performance metrics which a business is interested in following up, these are mainly sum & averages of figures collected by OLTP systems. There seems to be some misunderstanding as to how data warehouses should be designed, since in most cases, technical individuals do not really understand the broader scope of the business of their organisations.

Building blocks or Components

  1. Source Data
  2. Data Staging
  3. Data Storage
  4. Information Delivery
  5. Metadata
  6. Management and Control


Business Intelligence reports (e.g., MIS reports) may then be generated from the data managed by the warehouse. In this way the data warehouse supplies the data for and supports the business intelligence tools that an organization might use.

Popular Posts