How To Securely Store Passwords In Databases

Over the years, many organisations have been subject to breaches where hackers have stolen the databases of user accounts. In each case though, the reputational and financial damage caused to each organisation varied greatly depending on how passwords were stored.

Introduction

According to the Open Web Application Security Project, the top security risk for a website is an injection attack such as an SQL injection attack where an attacker will steal information from underlying databases by exploiting vulnerabilities in poorly-coded forms - such as login forms and search forms - forcing the website to execute and return data from arbitrary search queries.

On top of the vast trove of information that can be stolen from corporate databases, there will almost always be a table of users recording who can log in to the systems and their access privileges. With access to this information, a hacker is able to log in and imitate any user giving them the potential to cause significant harm on an ongoing basis.

In terms of costs to businesses, research from IBM and Ponemon shows that the cost of a "mega-breach" of 1,000,000 user records or more costs an average of $40,000,000 and there have been numerous cases in recent years, including big tech companies like LinkedIn, Myspace and Twitter.

However, by adopting the mindset of having security at every layer, organisations can prepare against being hacked by making any stolen user information less valuable with the use of cryptographic functions. Choosing the right cryptography though, it not as easy as many companies first thought.

Basic Database Design

The user table for a system will usually hold the following fields, at a minimum:

ID Email Password
1 aaron@gmail.com {value}
2 barbarba@hotmail.com {value}
3 charlie@company.com {value}

The email field will be constrained to only allow unique entries so that only one user can be found for any email/password combination when a user tries to log in. The important part of this database design is the value to store in the password field.

Plaintext Passwords

The most obvious way to save the password is to save it as plaintext:

ID Email Password
1 aaron@gmail.com Aaron123
2 barbarba@company.com Password1
3 charlie@company.com letmein

However, this is the worst way to handle password storage as access to every account in the system is immediately available by anyone with access to the database (including internal staff members - let's not forget they often pose the biggest security risks).

Unfortunately many big companies have handled passwords this way in the past and have had millions of their users' account data stolen:

  • In 2015 the web hosting company 000webhost was hacked and 13 million users' names, email addresses and plaintext passwords were exported.
  • In April 2018, the 3rd largest hack in Finland took place against The New Business Center where 130,000 user accounts were stolen with plaintext.
  • In May 2018, Twitter revealed it had been storing over 330 million user passwords in plaintext in a log file and urged all of their users to change their passwords.

Storing passwords in plaintext is surprisingly common in both large and small organisations and there is no good reason for it in terms of security.

Cryptographic Hash Functions

The most common improvement on plaintext passwords is to hash password before saving them to the database. Hashing is a one-way function, meaning in part that it is not feasible to "reverse" the algorithm to calculate the original password given only the hash value to start with. The most common cryptographic hash functions are md5() and the SHA hashes including sha1() and sha256().

This should be compared to encryption except that encryption is a two-way function where encrypted values can be decrypted back to the original value. However, this is a feature we want to avoid when storing passwords.

Hashing works by taking an input password "Aaron123" and running it through a cryptographic algorithm to return another value - the hash - like "0759F90FAF4AB60CE1CA9FA0C1A8F79F".

Using md5() as the hash function, we would have the following table:

ID Email Password
1 aaron@gmail.com 0759F90FAF4AB60CE1CA9FA0C1A8F79F
2 barbarba@hotmail.com 2AC9CB7DC02B3C0083EB70898E549B63
3 charlie@company.com 0D107D09F5BBE40CADE3DE5C71E9E9B7

It is clear to see in this case, that a hacker could not immediately access any user accounts as the values saved do not represent the real passwords.

For the website or application to continue working, when a user logs in the same hash function has to be applied to the password entered by the user before comparing it to the value in the database.

On first sight, it would appear that the passwords are now stored securely as:

  1. Access to user accounts is not possible with the stolen password values
  2. It is not possible to determine the real password values from the hashed values because of the nature of the hashing function

However, this security is only minimal and can easily be overcome with a few scripts.

Cracking Simple Hash Passwords

Instead of trying to reverse the hashing process, an attacker can simply guess the password billions of times within seconds using an automated script. This process is called cracking and a simple brute-force method would look like the following:

def crack_passwords:
	POTENTIAL_PASSWORDS = read_file_of_potential_passwords()

	for password in POTENTIAL_PASSWORDS:
		hashed_password = md5(password)

		users = database.get_all_users_with_password(hashed_password):

		for user in users:
			print "User " + user.email + " has password " + password

This type of attack to crack passwords is called an offline attack because it happens away from the production servers and any defences the website may already have in place - such as only allowing a finite number of login attempts before locking an account.

The file of potential passwords referenced would usually be a list of common passwords from sources like rockyou or john but could easily be generated on-the-fly going through every possible combination of words and letters in the alphabet - called a dictionary attack.

The efficiency of these brute-force dictionary attacks is the primary reason users of a website are encouraged to set strong passwords. This is because the longer and more complex a password, the longer it will take to guess it.

Assessing the script, the execution time of this script is directly proportional to the number of passwords in the wordlist. So if there were 10 or 10,000,000 stolen records, the time to guess a fixed number of passwords against the entire data set would be the same. The script is very unlikely to guess all of the passwords but the more passwords that are stolen, the more likely the script will find at least a few matches.

Large companies that have been breached that used simple cryptographic hash functions include:

  • LinkedIn: In 2012, data was stolen and subsequently sold on the black market containing 117 million user accounts, including passwords hashed with SHA-1.
  • Myspace: In 2013, a data breach of 427 million passwords, contained 68 million records of user accounts with passwords hashed with SHA-1.

In both of these cases, cracking the vast majority of the passwords in the stolen databases was trivial, measured only in days.

Cryptographic Hash Functions With Salts

Extra security can be applied when saving a password by adding a salt, which is simply a string of text. The best salts are long, unique and random and are commonly generated using UUID() functions to create a Universally Unique Identifier.

When including a salt, the password value saved in the database is calculated with:

hash_function(password + salt)

The hash and the salt are both saved to each user record:

ID Email Password Salt
1 aaron@gmail.com 12F76F5C4FB052A5B97D76CE3DCBFB2C 97ae93ee-d3c6-4f26-8117-a863211be0e2
2 barbarba@hotmail.com 19362C847EB76415F0A9ED42C8531E4D a72d5ab9-6913-4821-a1e0-20c7d58528a1
3 charlie@company.com 0E0E8C023EC792344957C84FA14F0724 e43a9355-52f7-4943-9e2c-eed853c0abf8

An equivalent method is used to authenticate users when they log in by salting and hashing the user-supplied password value before comparing it with the value in the database.

This method improves on cryptographic hash functions without salts by adding more entropy to each password - making them harder to guess.

Cracking Hashed and Salted Passwords

A similar script would be used to try to crack these hashes:

def crack_passwords:
	POTENTIAL_PASSWORDS = read_file_of_potential_passwords()

	for user in database.get_all_users():
		for password in POTENTIAL_PASSWORDS:
			hashed_password = md5(password + user.salt)

			if hashed_password == user.password:
				print "User " + user.email + " has password " + password

However, in this version of the script there is an extra for-loop to iterate over each user in the database as well as each password in the password list. This is because each user has a unique salt to apply before generating a hash to compare against the database value.

Consider two users both with the password "Password1". With a random salt applied to each, the database would save hashed versions of the passwords "Password1abc" and "Password1xyz" instead of simply the hashed version of "Password1". The suffixes "abc" and "xyz" as the salts are known to the hacker as they are stored in the database and would be appended to each word in the word list to guess the passwords. However, the generated hash to compare can only be compared to the user record with that exact salt to get a meaningful result. This makes the execution time of the script directly proportional to both the number of words in the word list and the number of users in the database - a significant increase with a large user database.

In any case though, this improvement is good but simply not good enough as processing power continues to increase in accordance with Moore's Law and provides relatively little advantage to hackers committed to cracking the passwords.

Many large corporations have been breached where stolen passwords were salted before they were hashed. This includes imesh.com - a now defunct peer-to-peer file sharing service - where, in 2013, 51 million accounts were stolen with salted and md5-hashed passwords.

Key Derivation Functions

Improving on cryptographic hash functions, key derivation functions introduce a configurable work factor to increase the time taken to hash a single password. The passwords are still salted but are then hashed and rehashed over and over again thousands of times. This process is called key stretching and makes the overall process significantly slower.

The premise of key derivation functions is that a user will not notice an additional 100 milliseconds of processing time to log in to a service but a hacker will notice this when multiplied by millions of attempts over millions of users.

The work factor itself is logarithmic so a work factor of 10 will make the hashing function run 210 (1024) times in total. LastPass, for example, force the hashing function to run over 100,000 times in an effort to protect passwords from potential cracking attempts.

Popular KBF functions include bcrpyt() and pbkdf() where the following tables uses bcrypt() with a work factor of 12.

ID Email Password
1 aaron@gmail.com $2y$12$WhlgmIYaMNlDp0tiZN4nNeix7CbA5RGl9eXpVTfjQhkOlaVgM.mSe
2 barbarba@hotmail.com $2y$12$dz30gpl4XuYl6es3/wOJk.nTMwz8ox6/ZDkjxRi0KwuWMe6kz5/Hy
3 charlie@company.com $2y$12$0VrmT1wUJsbd76HF3RZOy.IPGGJXqpJbhsZyD0T90AEF6E81Q3k6.

Examining the password for aaron@gmail.com, bcrypt() generated a password in a format known as modular crypt format:

  • 2y - the version/type of the hash function bcrypt()
  • 12 - the work factor
  • WhlgmIYaMNlDp0tiZN4nNeix7CbA5RGl9eXpVTfjQhkOlaVgM.mSe - the salt (the first 22 characters) and the hash (remaining characters)

The work factor of 12 means the hash function was executed 212 (4096) times to create the final hash value.

Primarily because of the work factor, use of a key derivation function is required by institutions such as NIST (US Gov.):

Verifiers SHALL store memorized secrets [i.e. passwords] in a form that is resistant to offline attacks. Memorized secrets SHALL be salted and hashed using a suitable one-way key derivation function. Key derivation functions take a password, a salt, and a cost factor as inputs then generate a password hash. Their purpose is to make each password guessing trial by an attacker who has obtained a password hash file expensive and therefore the cost of a guessing attack high or prohibitive.

However, even though more work and cost is required, it is still possible to crack these passwords.

Cracking KDF Passwords

The script used to crack passwords hashed with a key derivation function is exactly the same as the script seen earlier to crack passwords hashed with a cryptographic hash function and a salt, except that the hash function used would be changed.

However, at an estimated 100 milliseconds per hash with 1 million passwords and 10,000 users, the execution time of the script would be over 30 years.

In order to crack these passwords, only the most common passwords would be attempted from a list of between 100-1000 words, including passwords like "password", "Password1", "123456", "qwerty" and so on.

In order to attempt more password guesses, an attacker would need to setup a purpose-built machine. This is feasible because there is only one downside to this type of hash function as they are only intensive on processors but not on memory or disk space. A dedicated hacker could build a machine with many GPUs (such as an FPHA or an ASIC) and massively-parallelise across the hardware. This would take quite some skill, money and patience but is theoretically possible to a degree.

Of course, the security industry came together to improve the cryptography so this is not feasible.

Memory-Hard Functions

To improve security, the security industry got together and formed a competition to create the best algorithm to store passwords, which ran between 2013-2015. The winner was ultimately Argon2(), which is a memory-hard key derivation function.

Argon2() improves on bcrypt() and other key derivation functions by also introducing parameters for a time cost, a memory cost and a parallelism degree:

  • A time cost, which defines the execution time
  • A memory cost, which defines the memory usage
  • A parallelism degree, which defines the number of threads in use

The full specifications can be found on GitHub in a technical document.

There are three variants of Argon2:

  1. Argon2i: Faster, resistant to GPU cracking attacks and suitable for application such as cryptocurrency.
  2. Argon2d: Slower but provides more security for use in password hashing.
  3. Argon2id: A hybrid of the above providing mixed benefits of the above variants

This is state of the art in terms of password hashing as there is no known way to crack these passwords in any feasibly short time period.

NIST go so far as to recommend its use:

A memory-hard function SHOULD be used because it increases the cost of an attack.

Summary

Organisations should carefully consider how they store passwords in databases as part of their overall security policies to minimise the damage that can be caused by hackers.

By adopting the best security practices at every level, organisations can protect user accounts even in the event of a breach by hashing passwords appropriately - ideally with memory-hard key derivation functions.