Tech Insights for Educators #2: The nature of digital data

What is digital data? Mainly, data that is represented discretely—that is, in steps—rather than continuously. For example, while a mercury thermometer can represent infinitesimally small variations in temperature, a digital thermometer would be limited to displaying a specific value. While a more expensive, more accurate thermometer might be accurate to several decimal places (e.g., 62.341 degrees), it still cannot be as continuous as the analog equivalent.

When a continuous, analog equivalent is available, why would we want to limit ourselves by representing a phenomenon digitally? There are actually many reasons! Digital data can be more compact, transmittable, faithfully reproducible, duplicable, and losslessly manipulable. For instance, a set of photographic prints take up a lot of physical space, cannot be transmitted, cannot be reproduced without loss of fidelity, is not easily duplicated, and cannot be manipulated without loss of data. A set of digital photographic files can be stored in as small a space as a microSD memory card (the size of a fingernail), can be easily transmitted via bus including over the Internet, can be reproduced with high fidelity, can be easily duplicated via digital copying, and can be manipulated easily and without data loss (e.g., by making a copy).

In common parlance, a digit can be any of 10 values: 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9. However, when we talk about digital data, we are almost always talking about binary data. A bit, or binary digit, can only take on two values, represented as (0) zero (“off”) and (1) one (“on”). Bits are the building blocks of all modern computing. Even something as complex as a high-definition motion picture or immersive, interactive video game can be represented, stored, and processed as billions of bits.

Since approximately 1993, bits have been organized into groups of eight, called bytes (before, a “byte” might have had a different number of bits, but now it is universally eight). When you see storage capacity listed for a USB flash drive, optical disc, hard or solid-state disk drive, smartphone, et cetera, it is listed in bytes. Because a byte has eight bits, it can take on one of 256 values. That is, the number of potential combinations for a byte is 2^8, which is 2 × 2 × 2 × 2 × 2 × 2 × 2 × 2 = 256, encompassing 00000000, 00000001, 00000010, … all the way to … 11111111.

This means a byte is enough space to store a typical character of text—for instance, the word “byte” can be represented by four bytes—one for each letter. You would need 26 different combinations to store all letters of the alphabet. If you add case sensitivity, you have to double this to 52 to be able to represent both “a” and “A,” “b” and “B,” et cetera. If we add digits 0–9, we now need 62 combinations. With 256 combinations, this leaves plenty of combinations for common symbols and punctuation. While there are characters requiring more than one byte to represent, because 256 combinations isn’t enough when you consider the vast range of symbols, typographical marks, or diacritical marks, eight bits is enough to represent most English text. In more complex text-editing environments (e.g., Microsoft Word), additional bytes are employed to represent other attributes such as font type, font size, and text style (e.g., bold, italics, underline).

If you have an essay of 5000 words in a simple text-editing environment, with an average word length of five characters, and, if we are generous and add two characters per word for spaces, line breaks, and punctuation, this gives 5000 × 7 = 35,000 bytes, or 280,000 bits. The 1981 Hayes Smartmodem could transmit 300 bits per second, so in 1981, our essay would take about 280,000 / 300 = 933 seconds to transmit (that is, just under 16 minutes). At the end of the dial-up era, transmission speeds in the United States improved to about 53,000 bits per second, which means our essay could be transmitted in just over five seconds. Modern Internet connections are asymmetric, meaning they download (receive) data faster than they can transmit (“send,” “upload”) data. As of 2013, the average United States Internet user can download 8,700,000 bits per second, and perhaps transmit 1,000,000 bits per second. Therefore, our 5000-word, 280,000-bit essay can now be transmitted in only 0.28 seconds! If we add time for network latency, which is basically limited by the speed of light, we can still transmit our essay in under a second, typically. This is simply impossible if the essay was represented in text on physical paper.

When talking about digital data, because we deal with such large numbers, it is necessary to introduce metric prefixes for ease of discussion and comprehension. That is, we talk about bytes and bits with prefixes that multiply them by factors of a thousand (“kilo”—kilobyte, kilobit), a million (“mega”—megabyte, megabit), a billion (“giga”—gigabyte, gigabit), or a trillion (“tera”—terabyte, terabit). Therefore, a megabit, commonly written as Mb or Mbit, is 1,000,000 bits (125,000 bytes). A megabyte, commonly written as MB, is 1,000,000 bytes (8,000,000 bits). Note that the lowercase “b” indicates a bit, while an uppercase “B” indicates a byte, which is eight bits.

Typically, network transmission speeds are discussed in bits, while storage capacity is discussed in bytes. A common Internet connection speed is asymmetric, with 10 Mb/sec downstream and 1 Mb/sec upstream, meaning that 10 Mb (1.25 MB) of data can be downloaded (received) per second, and 1 Mb (125 KB) of data can be uploaded (transmitted) per second. The Samsung Galaxy S8 smartphone comes with 64 GB of internal nonvolatile storage, meaning that it can store 64 billion bytes (512 billion bits). The latest microSD memory cards can reliably store 256 GB in an area smaller than a thumbnail, which is 2.048 trillion bits (2.048 Tb)!

Digital data can also be compressed. For example, our 35 KB essay has patterns in it which can be stored more succinctly. Doing so requires more computing power to encode and decode, but might reduce the amount of space needed to represent the essay to 10 KB. When dealing with text, this would be a lossless operation, meaning the compression results in no loss of fidelity when reversed (expanded or “decoded”). For example, the HTML, or hypertext markup language that is the foundation of this webpage, is losslessly compressed using “gzip” before being transmitted to, and subsequently decoded by, your web browser.

When we represent complex data such as audio, still photographs, and videos digitally, compression is vital, almost universal, and more commonly lossy, meaning that data in unimportant areas is permanently discarded to save storage space. If you remember the old days of audio compact-discs (CDs), they could only store 74 or 80 minutes of audio because they weren’t compressed. However, through a lossy compression mechanism known as MP3, you could store 10 hours of music on a CD! Similarly, JPEG is the most common method of lossily compressing digital photographs, and H.264 is a leading way to lossily compress digital audiovisual materials. While lossless compression formats exist for audio, images, and video, particularly with video, the space requirements are tremendous, which is why lossy compression algorithms are used to simplify and discard data in areas likely to be unimportant. For example, in a photograph with dark areas, JPEG encoding discards data in the dark areas because you are unlikely to see it. But, if you were to brighten the image, this data loss would become abundantly apparent! (Pictured right in example below—photograph by Richard Thripp.)

JPEG artifacts in shadows pictured left

Most lossy compression algorithms, and even lossless compression algorithms, let you specify the degree of compression. If you want to save more space, you can choose to do so. However, with lossy algorithms, you will lose fidelity, and with lossless algorithms, although no fidelity will be lost, more computational power will be required to compress and decompress the data.

Humans cannot actually listen to audio nor view a photograph or video in binary format. When you view a digital image, you are actually seeing an analog representation of that image. Mainly, this means it could look different depending on the device or medium of presentation. For example, a digital image displayed on a computer monitor may look different than when displayed on a smartphone, or printed on paper. However, the digital data itself remains the same and can be duplicated without loss of data. In the old days, we would have an analog “master” copy of an audio recording, still image, or video that would be duplicated with loss of fidelity. Then, when that master copy wore out from being frequently duplicated, we might be limited to duplicating a copy of the master copy, and eventually a copy of a copy of a copy, with declining quality each time. For example, security cameras often used to use analog tape that would be recorded and re-recorded ad nauseam, causing the tape to degrade. If the tape was not replaced regularly, shoplifters might appear on the tape as a useless, fuzzy blob. Digital recording largely eliminates this type of problem. (Although repeatedly subjecting digital data to a lossy encoding algorithm produces similar effects, the master copy itself does not degrade by being accessed or duplicated—unless you erase it!)

Digital data, particularly when compressed, is more fragile than analog data. For example, if the signal was bad, analog television transmissions often had noise or “snow,” but could still be watched. However, digital television transmissions stutter or are completely unwatchable if the signal is bad.

Intuitively, it makes sense that uncompressed digital data is more resilient than compressed digital data, meaning that we could lose part of the data and still be able to view the rest of it. For example, if we lost part of our 35 KB essay file, we could still read the rest of it. However, if we compress it to 10 KB, the compression algorithm might require all of those 10 kilobytes to be present to produce readable output. In fact, the more powerful the compression, the more likely that every bit is required to produce any usable output, because of how efficiently and intricately the data is compressed. Moreover, if we lose or forget how the algorithm to decompress the data, we are lost! Nevertheless, compression is necessary, valuable, and relatively safe if we stick with popular and mainstream formats.

Although a byte has eight bits, it can be more useful to represent it as a number using all 10 digits, or as a “hexadecimal” code. While you would think a base-10 representation would be numbered 1–256, in fact, counting from (0) zero is the prevailing practice, so we would represent the binary byte 00000000 as 0, 00000001 as 1, 00000010 as 2, 10000000 as 128, 11110000 as 240, and 11111111 as 255. In contrast to base-10, hexadecimal extends base-10 to base-16, giving us 16 combinations to work with in one character instead of 10. While in base-10, 9 is the 10th and final character, hexadecimal extends this by making A the 11th character, B the 12th character, C the 13th character, D the 14th character, E the 15th character, and F the 16th character. Therefore, 0 (00000000) is 00 and 255 (11111111) is FF in hexadecimal.

It is very common to represent colors in hexadecimal, three-byte R–G–B format. Here, 16,777,216 colors (2^24) can be represented hexadecimally with only six characters, representing 24 bits. R, G, and B stand for red, green, and blue (the three additive primary colors), with higher values indicating brighter colors. In a six-character hexadecimal color code, Characters 1–2 represent red, Characters 3–4 represent green, and Characters 5–6 represent blue. FF is the highest intensity, while 00 is the lowest intensity. Thus, pure red would be FF0000, pure green would be 00FF00, pure blue would be 0000FF, pure white would be FFFFFF, and pure black would be 000000.

Twenty-four bits per pixel is considered a “true color” image. However, if we were to store a photograph from a 15-megapixel (MP) digital camera in true color without compression, we would need three bytes per pixel, or 45 MB! JPEG compression is essential for reducing this to a more manageable filesize of approximately 2–5 MB.

While this was by no means an exhaustive discussion of digital data and focused primarily on capacity, representation, and compression rather than other concerns such as storage, volatility, latency, transmission, processing, and encryption, nonetheless, you should now have a grasp of the fundamental underpinnings of the digital world.

Leave a Reply

Your email address will not be published.