Skip to main content

Section 2.1 Arabic Analysis of the Alphabet

Definition 2.1.1. Cryptanalysis.

Cryptanalysis is the process by which we try to determine the meaning of a message without the aid of a key. We will sometimes describe this as decrypting a message as opposed to deciphering a message which is what we do when we know the key.

The first place math was employed in the science of cryptology was in the analysis of languages. The following is from al-Kindi's treatise on cryptanalysis written around the year 873 CE. Try to read through it carefully and then consider the questions that follow it.

"Algorithms of Cryptanalysis

So we say, the enciphered letters are either in numerical proportions, that is poetry -because poetic meter, ipso facto, sets measures to the number of letters in each line-, or they are not. Non- poetry can be cryptanalyzed using either quantitative or qualitative expedients.

The quantitative expedients include determining the most frequently occurring letters in the language in which cryptograms are to be cryptanalyzed. If vowels functioned as the material from which any language is made, and non-vowels functioned as the shape of any language, and since many shapes can be made from the same material, then the number of vowels in any language would be greater than non- vowels. For instance, gold is the material of many shapes of finery and vessels; it may cover crowns, bangles, cups, etc.. The gold in these realizations is more than the shapes made of it. Similarly, the vowels which are the material of any kind of text are more than the non- vowels in any language. I mean by vowels the letters: (a), (y or i or e) and (o or u). Therefore the vowels in any language, inevitably, exceed in number the non-vowels in a text of that language. It happens that in certain languages some vowels are greater in number than some other vowels, while non-vowels may be frequent or scarce according to their usage in each language, such as the letter (s), of which frequency of occurrence is high in Latin.

Among the expedients we use in cryptanalyzing a cryptogram if the language is already known, is to acquire a fairly long plaintext in that language, and count the number of each of its letters. We mark the most frequent letter "first", the second most frequent "second", and the following one "third", and so forth until we have covered all its letters. Then we go back to the message we want to cryptanalyze, and classify the different symbols, searching for the most frequent symbol of the cryptogram and we regard it as being the same letter we have marked "first" -in the plaintext-; then we go to the second frequent letter and consider it as being the same letter we have termed "second", and the following one "third", and so on until we exhaust all the symbols used in this cryptogram sought for cryptanalysis.

It could happen sometimes that short cryptograms are encountered, too short to contain all the symbols of the alphabet, and where the order of letter frequency cannot be applied. Indeed the order of letter frequency can normally be applied in long texts, where the scarcity of letters in one part of the text is compensated for by their abundance in another part.

Consequently, if the cryptogram was short, then the correlation between the order of letter frequency in it and in that of the language would no longer be reliable, and thereupon you should use another, qualitative expedient in cryptanalyzing the letters. It is to detect in the language in which cryptograms are enciphered the associable letters and the dissociable ones. When you discern two of them using the letter order of frequency, you see whether they are associable in that language. If so, you seek each of them elsewhere in the cryptogram, comparing it with the preceding and following dissociable letters by educing from the order of frequency of letters, so as to see whether they are combinable or non-combinable. If you find that all these letters are combinable with that letter, you look for letters combinable with the second letter. If found really combinable, so they are the expected letters suggested by the combination and non-combination of letters, and also by their order of frequency. Those expected letters are correlated with words that make sense. The same procedure is repeated elsewhere in the ciphertext until the whole message is cryptanalyzed." [12, vol. 1, pp. 121-123]

Comprehension Check:

  • What do you think the author means when he says “vowels function as the material of a language”?
  • In what way then do the “non-vowels function as the shape”?
  • He also says that there are more vowels than non-vowels, how many vowels are in this sentence you are reading right now? Were there more vowels? If not then how might his statement still be true?
  • How does the gold in his analogy function like the vowels?
  • Finally, how do the author's comments compare to your experiences in section Section 1.1?

What al-Kindi is describing above is what we now call frequency analysis which is the first step in cryptanalysis.

Definition 2.1.2. Frequency Analysis.

Basic Frequency Analysis is the process of counting the characters in a text in order to determine how many of each character there are relative to the entire length of the text. This is typically the first step in the cryptanalysis, the process of breaking an unknown cipher or code.

Try following al-Kindi's directions in paragraph three above. Use the n-gram counter below to count the number of times each letter appears in the following paragraph and with what frequency, then plot the frequency of each character on the chart below.

“In the year 1878 I took my degree of Doctor of Medicine of the University of London, and proceeded to Netley to go through the course prescribed for surgeons in the army. Having completed my studies there, I was duly attached to the Fifth Northumberland Fusiliers as Assistant Surgeon. The regiment was stationed in India at the time, and before I could join it, the second Afghan war had broken out. On landing at Bombay, I learned that my corps had advanced through the passes, and was already deep in the enemy’s country. I followed, however, with many other officers who were in the same situation as myself, and succeeded in reaching Candahar in safety, where I found my regiment, and at once entered upon my new duties.” - A Study in Scarlet, Sir Arthur Conan Doyle [3]

Figure 2.1.4. Axes for Mapping Letter Frequencies

Below is the same paragraph as in Checkpoint 2.1.3 only now enciphered with a shift of three. As above, use the n-gram counter below to count the number of times each letter appears and with what frequency, then plot the frequency of each character on the chart below.

LQWKH BHDUL WRRNP BGHJU HHRIG RFWRU RIPHG LFLQH RIWKH XQLYH
UVLWB RIORQ GRQDQ GSURF HHGHG WRQHW OHBWR JRWKU RXJKW KHFRX
UVHSU HVFUL EHGIR UVXUJ HRQVL QWKHD UPBKD YLQJF RPSOH WHGPB
VWXGL HVWKH UHLZD VGXOB DWWDF KHGWR WKHIL IWKQR UWKXP EHUOD
QGIXV LOLHU VDVDV VLVWD QWVXU JHRQW KHUHJ LPHQW ZDVVW DWLRQ
HGLQL QGLDD WWKHW LPHDQ GEHIR UHLFR XOGMR LQLWW KHVHF RQGDI
JKDQZ DUKDG EURNH QRXWR QODQG LQJDW ERPED BLOHD UQHGW KDWPB
FRUSV KDGDG YDQFH GWKUR XJKWK HSDVV HVDQG ZDVDO UHDGB GHHSL
QWKHH QHPBV FRXQW UBLIR OORZH GKRZH YHUZL WKPDQ BRWKH URIIL
FHUVZ KRZHU HLQWK HVDPH VLWXD WLRQD VPBVH OIDQG VXFFH HGHGL
QUHDF KLQJF DQGDK DULQV DIHWB ZKHUH LIRXQ GPBUH JLPHQ WDQGD
WRQFH HQWHU HGXSR QPBQH ZGXWL HV

A Study in Scarlet, Sir Arthur Conan Doyle [3]

Figure 2.1.6. Axes for Mapping Letter Frequencies

How did the plot change? In what ways did the plot not change?

Find a large sample of normal English text (at least 500 characters) and repeat what you did in Checkpoint 2.1.3; that is use the n-gram counter below to count the number of times each letter appears and with what frequency. Then plot the frequency of each character on the chart below.

Figure 2.1.8. Axes for Mapping Letter Frequencies

Take the English text you used in Checkpoint 2.1.7 and encipher it with a shift cipher. Analyze the ciphertext as you did in Checkpoint 2.1.5. How did the plot change? In what ways did the plot not change?

Figure 2.1.10. Axes for Mapping Letter Frequencies

Now use the n-gram counter to find the letter frequencies for the letters in this cipher text.

XLMWM EFPSG OSJVI PEXMZ IPCRS VQEPI RKPMW LXIBX LSTIJ YPPCA
LIRCS YEREP CDIXL MWCSY AMPPW IIXLE XIZIR ALIRX LMWMW IRGMT
LIVIH XLIPI XXIVJ VIUYI RGMIW WXECX LIWEQ IEWPS RKEWX LIGMT
LIVMW QSRSE PTLEF IXMGM RTEVX MGYPE VJSVE WLMJX GMTLI VXLIP
IXXIV JVIUY IRGMI WIZIR QEMRX EMRXL IWEQI VIPEX MZITS WMXMS
RXSIE GLSXL IVNYW XWPMH EPSRK PMOIX LIPIX XIVWX LMWQE OIWWY
GLEGM TLIVZ IVCIE WCXSW TSXER HGVEG O

Plot the frequencies you found using Figure 2.1.4. If you compare the shape of the chart you just made to the chart for normal English which you made previously in Checkpoint 2.1.3 do you notice any similarities? Can you use this to try and decrypt this message?

N-Gram Counter:

To use the n-gram counter copy and paste the text you wish to analyze into the input box, and select 1 for N since we are analyzing single letters.

Figure 2.1.12. N-Gram Analysis Tool

Substitution Cipher Tool:

To use the substitution cipher tool to encipher a message leave the plain text alone and enter the corresponding ciphertext in the box labeled cipher. For a simple shift cipher you can put the alphabet into the cipher box in the regular order and then use the shift drop down menu to select your desired shift.

Figure 2.1.13. Substitution Cipher Tool

Repeat what you did before in Checkpoint 2.1.7 with text from a variety of sources. Be sure to try both long and short pieces of text. Do you agree with the al-Kindi's statements about shorter pieces of text? Finally, make a table of your results for future reference.