What is database normalization?

Introduction

Database normalization is a crucial technique that aims to eliminate redundancy in relational databases. Doing this will allow your database schema to maintain clean and well-structured tables across the database. It will also improve your data integrity. This beginners’ tutorial will explain what you need to know about this subject to help your database management system (DBMS).

Why is database normalization important?

Very large databases are plagued with having multiple entries and records, with duplicated data. You'll eventually need to arrange entries so that other professionals can work on them. This is why database normalization is important. It will help you find redundant data. We're going to discuss basic concepts of database normalization. These include the types, the use cases, and the importance of normalized databases for database schemas.

Understanding database normalization

Database normalization is a database manipulation technique where existing tables are modified to reduce redundancy in the schema. Normalization means splitting a large table into smaller, more precise tables. All tables will still maintain some kind of relationship between them, with the aim of increasing clarity in the data. Increasing performance is one of the major goals of normalization. Repeated data ultimately will slow down the overall system. With normalization practices, we'll divide data into smaller bits that will increase speed.

To achieve database normalization, the database tables will be further split into several smaller tables that are linked through their relationships.

With the help of primary keys, foreign keys, and composite keys, a database administrator is able to create and maintain relationships between tables. Let's take a deeper dive.

The database normalization process

The database normalization process involves a series of steps that's commonly known as normal forms, with each of the forms building up on the previous one. To begin building on these forms, we have to make sure we already know how to establish relationships between tables using keys.

So for example, the primary key in the table student_data will be related in another that we can call student_results:

CREATE TABLE student_data ( id INT PRIMARY KEY AUTO_INCREMENT, name VARCHAR(255), ); CREATE TABLE student_results ( id INT PRIMARY KEY AUTO_INCREMENT, student_data_id INT, result INT, FOREIGN KEY (student_data_id) REFERENCES student_data (id) );

Primary key: This is a unique value found in a single column, used to identify a table. A primary key value cannot be NULL. Foreign key: A foreign key is a primary key that's found in another different table.

A composite key: A composite key can be used to uniquely identify a row and comprises of more than one column. A primary key can be a composite key.

Database normal forms

Normal forms in database normalization refer to a set of rules that defines all the levels of organization and dependency within a relational database.

Each form represents a particular level of normalization, progressively advancing the level of organization and reducing data redundancy. The more we go down the level of forms, the more data we're ultimately cleaning, playing a crucial role in structuring and organizing data systematically.

Normal forms are also helpful when it comes to dependency identification. Each normal form tackles different types of dependencies within data, allowing for efficient data management. Let's learn more about the data normalization types, called normal forms. First normal form (1NF)

The first normal form is primarily for eliminating repeating groups. So for a table to be considered in 1NF, it must meet these criteria:

Compulsory primary key for table identification. No duplicated rows and columns. A single value must be present for each row in a column, as well as in a single cell. To fully understand the concept, let's make use of a simple example to cement the concept in.

We'll build a database that stores important information about students’ applications to a school, for every year that school accepts new students.

Our table will have student_id, name, state, and class: CREATE TABLE student ( id INT PRIMARY KEY AUTO_INCREMENT, name varchar(64), state varchar(128), class varchar(32), ); student

INSERT INTO student (id, name, state, class) VALUES (‘1’, ‘Bob’, ‘Alabama’, ‘4th grade’), (‘2’, ‘Jane’, ‘Georgia’, ‘7th grade’), (‘3’, ‘Allison’, ‘Washington’, ‘6th grade’); id name state class 1 Bob Alabama 4th grade 2 Jane Georgia 7th grade 3 Allison Washington 6th grade

As you can see from the table, we have a primary key, present which exists in id. This means this table is in its first normal form. Second normal form (2NF)

While 1NF helps to improve data organization by removing duplicate or repeating groups, it may not fully address all present types of redundancy. To further address this, we need to apply the second normal form (2NF), which builds upon 1NF.

For a table to be considered to be in 2NF, it must meet this set of rules: It must be in 1NF already.

All the attributes that exist in the second normal form must depend on the primary key that exists in the first normal form table.

Let's go back to our previous table to demonstrate how we can get the table in 2NF and why.

With the student_id, we can very easily determine name, state, and class because they are the same person, and this information should belong to that single student_id, which is the primary key. For the table to be in 2NF, we'll have to separate the tables. This will now be a student_class table. CREATE TABLE student ( id INT PRIMARY KEY AUTO_INCREMENT, fname VARCHAR(64), lname VARCHAR(64), ); CREATE TABLE class ( id INT PRIMARY KEY AUTO_INCREMENT, name VARCHAR(64) ); class id name 1 4th grade 2 7th grade

CREATE TABLE state ( id INT PRIMARY KEY AUTO_INCREMENT, name VARCHAR(64) ); state id name 1 Georgia 2 Alabama

CREATE TABLE student_class ( student_id INT, class_id INT), FOREIGN KEY (student_id) REFERENCES student(id), FOREIGN KEY (class_id) REFERENCES class(id) ); student_id class_id 1 2 2 1

We can also have one more table and call it student_state: CREATE_TABLE student_state ( student_id INT, state_id INT FOREIGN KEY (student_id) REFERENCES student(id), FOREIGN KEY (state_id) REFERENCES state(id) ); student_id state_id 1 1 2 1

Now we can get all kinds of information using the student id. Let's take a step further. Third normal form (3NF)

To achieve 3NF we need to eliminate transitive dependencies. We already removed redundancy by getting our database through the first and second normal forms.

All the attributes not part of our earlier created candidate's key will be dependent on other attributes that are not part of it as well. They are mostly called non-prime attributes.

Our table must be in the second normal form before we can start making some edits. There also has to be no transitive dependency.

Since a student name is associated to the state, that still leaves room for a separated table named state containing state id and the state name: CREATE TABLE state ( id INT PRIMARY KEY AUTO_INCREMENT, name VARCHAR(64) );

id name 1

Alabama 2 Georgia 3 Washington

Additional forms – Boyce-Codd normal form (BCNF), fourth normal form (4NF), and fifth normal form (5NF)

There are situations where even after having your database in 3NF, there will still be some anomalies that you might need to separate even further.

This is where the additional forms of database normalization come into play:

BCNF: It comes in handy when your table has more than one candidate key. This is usually called the 3.5 normal form.

4NF: To achieve 4NF, we'll decompose the table to completely eliminate multivalued dependencies by creating a separate table for each dependency.

5NF: When you cannot decompose or split the table further, it is said to be in its fifth normal form.

Know that these database designs might not be necessary for every database, and some may only need to be used in very exceptional occasions.
A scenario where using a higher form of database normalization design would be ideal would be when you're dealing with complex database schema that involves many-to-many relationships. A perfect example would be an organization or university that tracks courses and students. Each student can enroll in multiple courses, and each course can have multiple students, receiving multiple different grades. Using higher forms of normalization is ideal to remove repetition and keep data integrity.

Practical tips for database normalization

Database normalization only becomes effective when it's done effectively and meticulously. To achieve this, here are some practical tips to guide you:

It's considered best practice to ensure that you carry out the normalization process gradually. Ensure your database is in the first form, then proceed accordingly. Know what and why you're normalizing. While you're aiming to reduce redundancy, it may result in increased complexity in queries, since you'll have more tables.

Maintaining documentation should not be overlooked. Each of your database schema should be properly documented for a clear point of reference along the line.

Also, before choosing a primary key for your table, ensure that it's unique, and the size is on the smaller side to guarantee efficiency. Do not use long strings as values except it's absolutely necessary. Conclusion

Database normalization is crucial for all database design and may come in handy along the way in your organization database schema.

Understanding when to use database normalization is also very important. There are some peculiar cases where it might not come in handy.