Encoding: A Brief History and its Role in Cybersecurity

When it comes to using the Internet, chances are you’ve encountered different types of character encoding. These include standards like ASCII, ANSI, Latin-1, ISO 8859-1, Unicode, UTF-7, UTF-8, UCS-2, and URL-encoding. With so many options, it can be hard to keep track of the differences between each standard and know when to use one over the other. That’s where this blog comes in. Here, we’ll delve into the intricacies of character encoding, provide a better understanding of each of these standards, and how encoding is used in the field of cybersecurity. 

What is Encoding? 

Encoding is the process of converting data into a specific format that can be easily transmitted or stored. In the context of character encoding, this means converting written language into a digital format that computers can understand and process. This allows us to easily access and share text-based information online. 

The original purpose of ASCII (American Standard Code for Information Interchange) was to provide a standard way of representing and transmitting text-based information between computers and other devices. It was created in the 1960s and initially included only 128 characters, including the letters of the alphabet, numbers, punctuation marks, and control characters. This allowed for the efficient transmission of basic text-based information between computers but did not provide support for other characters used in other languages. 

When computers began using 8-bit bytes, various governments and groups started thinking about ways to utilize the remaining 128 places (128-255) in the byte. However, they couldn’t agree on how to use these additional characters, leading to confusion and chaos. This made it difficult for documents to be read when sent from one system to another, as they were using different character standards. 

To address this issue, IBM (OEM) and ANSI (Microsoft) systems were created, which defined code pages that consisted of ASCII for the bottom 127 characters and a given language variation for the top 128 characters. This allowed for the creation of code pages like 437 (IBM), 737 (Greek), 862 (Turkish), and more, each with 256 characters, including ASCII in the first 127 and the respective language for the other 128. This ensured that anyone using the same code page could exchange documents without having their content distorted. 

Introducing Unicode 

Once the internet connected everyone and it became clear that 8-bit bytes were not sufficient for exchanging text on a global scale, there was a fundamental shift in how encoding was accomplished. This led to the development of Unicode, which provided a standard way of representing and encoding text-based information for any language. Unicode uses a unique number for each character, allowing for the efficient representation and transmission of text-based information in any language. This addressed the issue of different systems using different character standards and ensured that documents could be read and understood across different languages and systems. 

Unicode and UCS 

Unicode and UCS (Universal Character Set) are closely related standards for representing and encoding text in digital form. Unicode is a character encoding standard that provides a unique number for every character, regardless of the platform, program, or language. It was designed to support the interchange, processing, and display of the written texts of the world’s diverse languages. 

Unicode is a character encoding standard that assigns a unique number (called a code point) to each character in the UCS. This allows computers to store and manipulate text in a consistent and standardized way, regardless of the platform or program being used. Unicode uses a consistent, organized, and standardized system for assigning code points to characters, which makes it possible to display and process text in multiple languages and scripts. 

UCS, on the other hand, is a broader standard that defines a set of characters and their associated properties, such as their names and category. This includes information such as the character’s name, category, and general appearance. It provides a consistent way of encoding and representing characters across different platforms and languages, and it forms the basis for the Unicode standard. UCS is a broader standard that defines a set of characters and their associated properties. The UCS includes characters from most of the world’s written languages, as well as symbols and other characters used in various contexts. 

Unicode is a specific implementation of the broader UCS standard, and it provides a unique number (code point) for each character defined in the UCS. The main difference between the two is that Unicode is a character encoding standard, while UCS is a character set standard. 

One important thing to note is that Unicode and UCS are not limited to representing written languages. They can also encode other symbols, such as mathematical and technical symbols, emoji, and other special characters. This makes them versatile and widely used standards for representing and encoding text in digital form. 

UTF-7 and ISO 8859 

UTF-7 (Unicode Transformation Format – 7-bit) is a character encoding that allows for the efficient transmission of Unicode-encoded characters over 7-bit networks, such as email systems. Unlike other forms of encoding, which may require 8-bit or larger bytes, UTF-7 uses only 7-bit bytes, making it suitable for transmission over networks with limited bandwidth. This allows for the efficient transmission of Unicode characters in email messages and other text-based communications. 

ISO 8859 (also known as ISO Latin-1) is a character encoding standard that defines a set of 256 characters for representing written language. It is based on the ASCII character set but includes additional characters for representing characters used in various European languages. This allows for the efficient representation and transmission of text-based information in these languages. 

In comparison, UTF-7 is a variation of the Unicode character encoding standard that allows for the efficient transmission of Unicode-encoded characters using only 7-bit bytes. This makes it suitable for transmitting Unicode characters over networks with limited bandwidth, such as email systems. Unlike ISO 8859, UTF-7 allows for the representation of characters from any language, not just European languages. Additionally, UTF-7 can represent a larger range of characters than ISO 8859, as Unicode includes over 100,000 characters. 

BMP 

The Basic Multilingual Plane (BMP) is a subset of the Unicode character encoding standard that includes the most commonly used characters from the world’s languages. It covers the first 65,536 code points in the Unicode character set, which includes characters from the Latin, Greek, and Cyrillic alphabets, as well as a wide range of symbols and punctuation marks. This allows for the efficient representation and transmission of text-based information in a wide range of languages. 

The purpose of the Basic Multilingual Plane (BMP) is to provide a standard way of representing and encoding the most used characters from the world’s languages. By covering the first 65,536 code points in the Unicode character set, the BMP allows for the efficient representation and transmission of text-based information in a wide range of languages. This ensures that documents can be read and understood across different languages and systems. 

Encoding and Cybersecurity 

Encoding can be both a blessing and a curse in the field of cybersecurity. While it helps protect data and ensure proper communication, it can also be used by malicious actors to conceal their activities. Therefore, understanding encoding and being able to work with different encoding schemes is a crucial skill in cybersecurity. 

Threat actors certainly use various forms of encoding to hide malicious activity. This could involve encoding a payload in Base64, hexadecimal, or any other encoding scheme to avoid detection by security systems. Many malware payloads are obfuscated using encoding techniques to make it harder for security professionals to reverse engineer them. These threats might even use multiple layers of encoding, requiring several decoding steps to reveal the true payload. 

However, encoding can indeed be used by security teams to protect an organization’s data. For instance, encryption, a form of encoding, is a fundamental tool in data security. By encoding sensitive data, organizations can ensure that even if the data falls into the wrong hands, it can’t be easily understood or misused. 

Conclusion 

To summarize, character encoding is the process of converting written language into a digital format that computers can understand and process. This allows for the efficient representation and transmission of text-based information online. There are various character encoding standards, including ASCII, ANSI, Latin-1, ISO 8859-1, Unicode, UTF-7, UTF-8, and UCS-2. Unicode is a widely used standard that provides a unique number for each character, allowing for the efficient representation and transmission of text-based information in any language. The Basic Multilingual Plane (BMP) is a subset of Unicode that covers the most commonly used characters from the world’s languages. This allows for the efficient representation and transmission of text-based information in a wide range of languages. 

Encoding is both a blessing and a curse in the field of cybersecurity, as threat actors take advantage of encoding to mask malicious activity. On the bright side, encoding is an essential tool used by security teams to protect an organization’s data. This makes it crucial for cybersecurity professionals to understand and work with encoding.

Learn more about Critical Start’s Cyber Research Unit and stay up-to-date with the latest threat research.


You may also be interested in…