Lesson 1: Simple Substitution

                 CLASSICAL CRYPTOGRAPHY COURSE
                           BY LANAKI
                       September 27, 1995


                           LECTURE 1
                      SIMPLE SUBSTITUTION



INTRODUCTION

Cryptography is the science of writing messages that no one
except the intended receiver can read.   Cryptanalysis is the
science of reading them anyway.  "Crypto" comes from the Greek
'krypte' meaning hidden or vault and "Graphy" comes from the
Greek 'grafik' meaning writing.  The words, characters or
letters of the original intelligible message constitute the
Plain Text (PT).  The words, characters or letters of the
secret form of the message are called Cipher Text (CT) and
together constitute a Cryptogram.

Cryptograms are roughly divided into Ciphers and Codes.

William F. Friedman defines a Cipher message as one produced by
applying a method of cryptography to the individual letters of
the plain text taken either singly or in groups of constant
length.  Practically every cipher message is the result of the
joint application of a General System (or Algorithm) or method
of treatment, which is invariable and a Specific Key which is
variable, at the will of the correspondents and controls the
exact steps followed under the general system.  It is assumed
that the general system is known by the correspondents and the
cryptanalyst.                                       [FRE1]

A Code message is a cryptogram which has been produced by using
a code book consisting of arbitrary combinations of letters,
entire words, figures substituted for words, partial words,
phrases, of PT.   Whereas a cipher system acts upon individual
letters or definite groups taken as units, a code deals with
entire words or phrases or even sentences taken as units.
We will look at both types of systems in this course.

The process of converting PT into CT is Encipherment.  The
reverse process of reducing CT into PT is Decipherment.

Cipher systems are divided into two classes: substitution and
transposition.  A Substitution cipher is a cryptogram in which
the original letters of the plain text, taken either singly or
in groups of constant length, have been replaced by other
letters, figures, signs, or combination of them in accordance
with a definite system and key.  A Transposition cipher is a
cryptogram in which the original letters of the plain text have
merely been rearranged according to a definite system.  Modern
cipher systems use both substitution and transposition to
create secret messages.


SUBSTITUTION AND TRANSPOSITION CIPHERS COMPARED

The fundamental difference between substitution and
transposition methods is that in the former the normal or
conventional values of the letters of the PT are changed,
without any change in the relative positions of the letters in
their original sequences, whereas in the latter only the
relative positions of the letters of the PT in the original
sequences are changed, without any changes to the conventional
values for the letters.  Since the methods of encipherment are
radically different in the two cases, the principles involved
in the cryptanalyses of both types of ciphers are fundamentally
different.   We will look at the methods for determine whether
a cipher has been enciphered by substitution or transposition.


SIMPLE SUBSTITUTION

Probably the most popular amateur cipher is the simple
substitution cipher.  We see them in newspapers.  Kids use them
to fool teachers, lovers send them to each for special
meetings,  they have been used by the Masons, secret Greek
societies and by fraternal organizations.  Current gangs in the
Southwest use them to do drug deals.  They are found in
literature like the Gold Bug by Edgar Allen Poe, and death
threats by the infamous Zodiak killer in San Francisco in the
late 1960's.

The Aristocrats (A1-A25) in the Aristocrats Column of "The
Cryptogram"  are all simple substitution ciphers in English.
Each English plain text letter in all its occurrences in the
message is replaced by a unique English ciphertext letter. The
mathematical process is called one-to-one contour mapping.  It
is unethical (and a possible wedge for the analyst) to use the
same ciphertext letter for substitution for a plaintext letter.

A recurring theme of my lectures is that all substitution
ciphers have a common basis in mathematics and probability
theory.  The basis language of the cipher doesn't matter as
long as it can be characterized mathematically.  Mathematics is
the common link for deciphering any language substitution
cipher.  Based on mathematical principles, we can identify the
language of the cryptogram and the break open its contents.


FOUR BASIC OPERATIONS OF CRYPTANALYSIS

William F. Friedman presents the fundamental operations for the
solution of practically every cryptogram:

(1)  The determination of the language employed in the plain
       text version.

(2)  The determination of the general system of cryptography
       employed.

(3)  The reconstruction of the specific key in the case of a
       cipher system, or the reconstruction of,  partial or
       complete, of the code book, in the case of a code system
       or both in the case of an enciphered code system.

(4)  The reconstruction or establishment of the plain text.

In some cases, step (2) may proceed step (1).   This is the
classical approach to cryptanalysis.  It may be further reduced
to:

    1.  Arrangement and rearrangement of data to disclose non-
        random characteristics or manifestations ( i.e.
        frequency counts, repetitions, patterns, symmetrical
        phenomena)

    2.  Recognition of the nonrandom characteristics or
        manifestations when disclosed (via statistics or
        other techniques)

    3.  Explanation of nonrandom characteristics when
        recognized.  (by luck, intelligence, or perseverance)

Much of the work is in determining the general system.  In the
final analysis, the solution of every cryptogram involving a
form of substitution depends upon its reduction to mono-
alphabetic terms, if it is not originally in those terms.
                                                       [FRE1]


OUTLINE OF CIPHER SOLUTION

According to the Navy Department OP-20-G Course in Crypt-
analysis, the solution of a substitution cipher generally
progresses through the following stages:

     (a)  Analysis of the cryptogram(s)

            (1) Preparation of a frequency table.
            (2) Search for repetitions.
            (3) Determination of the type of system used.
            (4) Preparation of a work sheet.
            (5) Preparation of individual alphabets (if more
                than one)
            (6) Tabulation of long repetitions and peculiar
                letter distributions.

     (b)  Classification of vowels and consonants by a study
          of:

            (1) Frequencies
            (2) Spacing
            (3) Letter combinations
            (4) Repetitions

     (c)  Identification of letters.

            (1) Breaking in or wedge process
            (2) Verification of assumptions.
            (3) Filling in good values throughout messages
            (4) Recovery of new values to complete the
                solution.

     (d)  Reconstruction of the system.

            (1) Rebuilding the enciphering table.
            (2) Recovery of the key(s) used in the operation
                of the system
            (3) Recovery of the key or keyword(s) used to
                construct the alphabet sequences.

All steps above to be done with orderly reasoning.  It is
not an exact mechanical process.                        [OP20]

Since this is a course in Cryptanalysis, lets start cracking
some open.


EYEBALL

While reading the newspaper you see the following cryptogram.
Train your eye to look for wedges or 'ins' into the cryptogram.
Assume that we dealing with English and that we have simple
substitution.  What do we know?   Although short, there are
several entries for solution.   Number the words.  Note that it
is a quotation (12, 13 words with * represent a proper name in
ACA lingo).


A-1.  Elevated thinker.  K2 (71)                 LANAKI


  1          2               3              4      5
F Y V   Y Z X Y V E F   I T A M G V U X V   Z E   F A


  5          6       7      8            9
I T A M   F Y Q F   M V   Q D V   E J D D A J T U V U


 10       11               12                 13
R O   H O E F V D O.   * Q G R V D F   * E S Y M V Z F P V D


ANALYSIS OF A-1.

Note words 1 and 6 could be: ' The....That' and words 3 and 5
use the same 4 letters  I T A M .    Note that there is a
flow to this cryptogram  The _ _ is? _  _ and? _ _.  Titles
either help or should be ignored as red herrings.  Elevated
might mean "high"  and the thinker could be the proper
person.   We also could attack this cipher using pattern
words (lists of words with repeated letters put into
thesaurus form and referenced by pattern and word length) for
words 2, 3, 6, 9, and 11.



Filling in the cryptogram using [ The... That]  assumption we
have:


  1          2               3              4     5
t h e   h     h e   t             e     e         t
F Y V   Y Z X Y V E F   I T A M G V U X V   Z E   F A


  5          6       7      8            9
          t h a t     e   a   e                   e
I T A M   F Y Q F   M V   Q D V   E J D D A J T U V U


 10       11               12                 13
            t e          a     e   t         h   e   t   e
R O   H O E F V D O.   * Q G R V D F   * E S Y M V Z F P V D


Not bad for a start.  We find the ending  e_t  might be 'est'.
A two letter word starting with t_ is 'to'.  Word 8 is 'are'.
So we add this part of the puzzle.   Note how each wedge leads
to the next wedge.  Always look for confirmation that your
assumptions are correct.  Have an eraser ready to start back
a step if necessary.   Keep a tally on which letters have
been placed correctly.  Those that are unconfirmed guesses,
signify with ?  Piece by piece, we build on the opening wedge.


  1          2               3              4     5
t h e   h     h e s t       o     e     e     s   t o
F Y V   Y Z X Y V E F   I T A M G V U X V   Z E   F A


  5          6       7      8            9
    o     t h a t     e   a r e   s   r r o       e
I T A M   F Y Q F   M V   Q D V   E J D D A J T U V U


 10       11               12                 13
          s t e r        a     e r t     s   h   e   t   e r
R O   H O E F V D O.   * Q G R V D F   * E S Y M V Z F P V D


Now we have some bigger wedges.  The  s_h is a possible 'sch'
from German.  Word 9 could be 'surrounded.'  Z = i.  The name
could be Albert Schweitzer.  Lets try these guesses.  Word 2
might be 'highest' which goes with the title.


  1          2               3              4     5
t h e   h i g h e s t     n o w l e d g e   i s   t o
F Y V   Y Z X Y V E F   I T A M G V U X V   Z E   F A


  5          6       7      8            9
  n o w   t h a t   w e   a r e   s u r r o u n d e d
I T A M   F Y Q F   M V   Q D V   E J D D A J T U V U
 10       11               12                 13
          s t e r        a l b e r t     s c h w e i t z e r
R O   H O E F V D O.   * Q G R V D F   * E S Y M V Z F P V D

The final message is: The highest knowledge is to know that we
are surrounded by mystery.  Albert Schweitzer.

Ok that's the message, but what do we know about the keying
method.


KEYING CONVENTIONS

Ciphertext alphabets are generally mixed for more security and
an easy pneumonic to remember as a translation key.   ACA
ciphers are keyed in K1, K2, K3, K4 or K()M for mixed variety.
K1 means that a keyword is used in the PT alphabet to scramble
it.  K2 is the most popular for CT alphabet scrambling.  K3
uses the same keyword in both PT and CT alphabets, K4 uses
different keywords in both PT and CT alphabets.   A keyword or
phrase is chosen that can easily be remembered.  Duplicate
letters after the first occurrence are deleted.

Following the keyword, the balance of the letters are written
out in normal order.  A one-to-one correspondence with the
regular alphabet is maintained.  A K2M mixed keyword sequence
using the word METAL and key DEMOCRAT might look like this:


                4  2  5  1  3
                M  E  T  A  L
                =============
                D  E  M  O  C
                R  A  T  B  F
                G  H  I  J  K
                L  N  P  Q  S
                U  V  W  X  Y
                Z

the CT alphabet would be taken off by columns and used:

     CT: OBJQX EAHNV CFKSY DRGLUZ MTIPW


Going back to A-1.  Since it is keyed aa a  K-2,  we set up the
PT alphabet as a normal sequence and fill in the CT letters
below it.  Do you see the keyword LIGHT?


PT  a b c d e f g h i j k l m n o p q r s t u v w x y z
CT  Q R S U V W X Y Z L I G H T A B C D E F J K M N O P
                     ----------
KW = LIGHT

In tough ciphers, we use the above key recovery procedure to go
back and forth between the cryptogram and keying alphabet to
yield additional information.


To summarize the eyeball method:

1. Common letters appear frequently throughout the message but
   don't expect an exact correspondence in popularity.

2. Look for short, common words (the, and, are, that, is, to)
   and common endings (tion, ing, ers, ded, ted, ess,

3. Make a guess, try out the substitutions, keep track of
   your progress.  Look for readability.


GENERAL NATURE OF ENGLISH LANGUAGE

A working knowledge of the letters, characteristics, relations
with each other, and their favorite positions in words is very
valuable in solving substitution ciphers.

Friedman was the first to employ the principle that English
Letters are mathematically distributed in a unilateral
frequency distribution:


  13 9 8 8 7 7 7 6 6 4 4 3 3 3 3 2 2 2 1 1 1 - - - - -
   E T A O N I R S H L D C U P F M W Y B G V K Q X J Z


That is, in each 100 letters of text, E has a frequency (or
number of appearances) of about 13; T, a frequency of about 9;
K Q X J Z appear so seldom, that their frequency is a low
decimal.

Other important data on English ( based on Hitt's Military
Text):

6 Vowels: A E I O U Y                         =  40 %
20 Consonants:
    5 High Frequency (D N R S T)              =  35 %
   10 Medium Frequency (B C F G H L M P V W)  =  24 %
    5 Low Frequency (J K Q X Z)               =   1 %
                                                ====
                                                100.%


The four vowels A, E, I, O and the four consonants N, R,
S, T form 2/3 of the normal English plain text.   [FR1]

Friedman gives a Digraph chart taken from Parker Hitts Manual
on p22 of reference.                              [FR2]

The most frequent English digraphs per 200 letters are:

TH--50      AT--25       ST--20
ER--40      EN--25       IO--18
ON--39      ES--25       LE--18
AN--38      OF--25       IS--17
RE--36      OR--25       OU--17
HE--33      NT--24       AR--16
IN--31      EA--22       AS--16
ED--30      TI--22       DE--16
ND--30      TO--22       RT--16
HA--26      IT--20       VE--16

The most frequent English trigraphs per 200 letters are:

THE--89       TIO--33      EDT--27
AND--54       FOR--33      TIS--25
THA--47       NDE--31      OFT--23
ENT--39       HAS--28      STH--21
ION--36       NCE--27      MEN--20

Frequency of Initial and Final Letters

Letters-- A B C D E F G H I J K L M N O  P Q R S  T U V W X Y Z
Initial-- 9 6 6 5 2 4 2 3 3 1 1 2 4 2 10 2 - 4 5 17 2 - 7 - 3 -
Final  -- 1 -  1017 6 4 2 - - 1 6 1 9 4  1 - 8 9 11 1 - 1 - 8 -

Relative Frequencies of Vowels.

A 19.5%   E 32.0%   I 16.7%  O 20.2%  U 8.0%  Y 3.6%

Average number of vowels per 20 letters, 8.

Becker and Piper partition the English language into 5 groups
based on their Table 1.1                      [STIN], [BP82]

                           Table 1.1
            Probability Of Occurrence of 26 Letters

       Letter     Probability       Letter   Probability
          A          .082             N          .067
          B          .015             O          .075
          C          .028             P          .019
          D          .043             Q          .001
          E          .127             R          .060
          F          .022             S          .063
          G          .020             T          .091
          H          .061             U          .028
          I          .070             V          .010
          J          .002             W          .023
          K          .008             X          .001
          L          .040             Y          .020
          M          .024             Z          .001

Groups
1.  E, having a probability of about 0.127

2.  T, A, O, I, N, S, H, R, each having probabilities between
    0.06 - 0.09

3.  D, L, having probabilities around 0.04

4.  C, U, M, W, F, G, Y, P, B, each having probabilities
    between 0.015 - 0.023.

5.  V, K, J, X, Q, Z, each having probabilities less 0.01.


LETTER CHARACTERISTICS AND INTERACTIONS

ELCY gives Data for English, German, French, Italian, Spanish,
Portuguese in her Appendices, p218 ff.  She also give tables of
letter contact data.                                 [ELCY]

LANAKI published data on English and 10 different languages as
well as expanded work on Chinese.  It is available at the CDB.
                                             [NIC1]  [NIC2]

S-TUCK gives detailed English, French and Spanish letter
characteristics in her book.                             [TUCK]

Friedman in his Military Cryptanalytics Part I - Volume 1
gives charts showing the lower and upper limits of deviation
from theoretical (random) for the number of vowels, high, low,
medium frequency consonants, blanks in distributions for
plain text and random text for messages of various lengths.
                                                     [FR1]

Friedman in his Military Cryptanalytics Part I - Volume 2
give a veritable pot puree of statistical data on letter
frequencies, digraphs, trigraphs, tetragraphs, grouped letters,
relative log data, special purpose data, pattern words,
idiomorphic data, standard endings, initials, foreign language
data [German, French, Italian, Spanish, Portuguese and
Russian], classification of systems used in concealment, nulls
and literals.                                         [FR2]

Sinkov assigns log frequencies to digraphs to aid in
identification.  The procedure is explained by Friedman.
                                                 [FR1]  [SINK]

"ACA and You" presents general properties of English letters.
                                                      [ACA]

Foster presents detail letter characteristics based on the
Brown Corpus.                                          [CCF]

Don L. Dow puts out a clever computer cryptogram game which
does frequency analysis and is user friendly for very simple
Aristocrats. {Available as shareware}                  [DOW]

Depending the basis text we choose,  we find variations in the
frequency of letters.  For example, literary English gives
slightly different results than frequencies based on military
or ordinary English text.








Hagn presented Literary English Letter Usage Statistics based
on "A Tale of Two Cities" by Charles Dickens as follows:[HAGN]

Total letter count =  586747
Letter use frequencies:     Total doubled letter count = 14421
E:    72881    12.4%        Doubled letter frequencies:
T:    52397     8.9%        LL:     2979    20.6%
A:    47072     8.0%        EE:     2146    14.8%
O:    45116     7.6%        SS:     2128    14.7%
N:    41316     7.0%        OO:     2064    14.3%
I:    39710     6.7%        TT:     1169     8.1%
H:    38334     6.5%        RR:     1068     7.4%
S:    36770     6.2%        PP:      628     4.3%
R:    35946     6.1%        FF:      430     2.9%
D:    27487     4.6%        NN:      301     2.0%
L:    21479     3.6%        CC:      243     1.6%
U:    16218     2.7%        MM:      207     1.4%
M:    14928     2.5%        DD:      201     1.3%
W:    13835     2.3%        GG:       99     0.6%
C:    13223     2.2%        BB:       41     0.2%
F:    13152     2.2%        ZZ:       13     0.0%
G:    12121     2.0%        AA:        2     0.0%
Y:    11849     2.0%        HH:        1     0.0%
P:     9452     1.6%
B:     8163     1.3%
V:     5044     0.8%
K:     4631     0.7%
Q:      655     0.1%
X:      637     0.1%
J:      623     0.1%
Z:      213     0.0%

Total initial letters =  135664  Total ending letters =  135759
Initial letter frequencies:      Ending letter frequencies:

T:    20665    15.2%             E:    26439    19.4%
A:    15564    11.4%             D:    17313    12.7%
H:    11623     8.5%             S:    14737    10.8%
W:     9597     7.0%             T:    13685    10.0%
I:     9468     6.9%             N:    10525     7.7%
S:     9376     6.9%             R:     9491     6.9%
O:     8205     6.0%             Y:     7915     5.8%
M:     6293     4.6%             O:     6226     4.5%
B:     5831     4.2%             F:     5133     3.7%
C:     4962     3.6%             G:     4463     3.2%
F:     4843     3.5%             H:     3579     2.6%

Top digraphs:
TH:   17783    RE:   8139   ED:   6217   IS:   5566
HE:   17226    ND:   7793   AT:   6200   NG:   5564
IN:   10783    HA:   6611   EN:   5849   IT:   5559
ER:   10172    ON:   6464   HI:   5730   OR:   4915
AN:   9974     OU:   6418   TO:   5703   AS:   4836





POSITION AND FREQUENCY TABLE

Time to put to good use the barrage of data presented.  Given
the next slightly harder cryptogram, and ignoring again a
pattern word attack, we can develop some useful tools.  [Much
of what I am covering can be done automatically by computer but
then your brain goes mushy for failure to understand the
process.]


A-2.  [no clue]                                 S-TUCK

V W H A Z S J X I H   S K I M F   M W C G M V   W O J S I F  -

A G F J A Q   Q M N R J K Z M G R S W M F.   J A T W   X H   -

A W F.    F I Q Q W F F X I H   F K H B A O Z   J S M A H H F.

T G A H P K D   X M A W O V F S A R F    X H K I M A F S.
[ Hyphens mean a continuation of a word.]

First we perform a CT Frequency Count.

 F  A  H  M  W  S  I  J  K  X  G  Q  O  R  V  Z  T B C D N P
13 11  9  9  8  7  6  6  5  5  4  4  3  3  3  3  2 1 1 1 1 1

We have 106 letters.  20% are considered low frequency.
20% of 106 = 21.  Counting from right to left we have O, R, V,
Z, T, B, C, D, N, P.  We mark A-2. with a dot over each
appearance.  We also enter the frequency data under the CT.

Next we develop a CT Letter Position Chart.
                                                    deduced
     F : I    2    3     -     3     2     E        PT equiv's
 A  11 :      /    /    .....  ///   /              i
 B   1 :                .                           v
 C   1 :           /                                w
 D   1 :                                   /        x
 F  13 : /    /         .....        /     /////    s
 G   4 :      /                 /                   a
 H   9 :      //   //   .       /    /     //       l
 I   6 :      /         ...          //             u
 J   6 : //        /    ..           /              t
 K   5 :      //   /    .            /              o
 M   9 :/    //    /    ..           //             r
 N   1 :           /                                y
 O   3 :      /                      /              n
 P   1 :                         /                  b
 Q   4 : /         /     .                  /       c
 R   3 :                 ..           /             p
 S   7 : /    /          ....               /       h
 T   2 : /                            /             m
 V   3 : /               .                  /       d
 W   8 : /    //         ..       /   /     /       e
 X   5 : ///                     //                 f
 Z   3 :                 ..                 /       g
    ===
    106
Columns represent the initial, first, second, third letters,
final and two preceding antepenultimate letters.  Dots for any
other position in word.

ANALYSIS of A-2. Using Vowel Selection Method.

The Vowel Selection Method is: 1) separate the vowels from the
consonants, 2) assign vowel identities, 3) assign identities to
consonants.


A-2.  [no clue]                                    S-TUCK
         1                2           3            4
.       .                             .     .     .
V W H A Z S J X I H   S K I M F   M W C G M V   W O J S I F  -
3 8 9 + 3 7 6 5 6 9   7 5 6 9 *   9 8 1 4 9 3   8 3 6 7 6 *

                          5                    6       7
                  . .     .     .                .
A G F J A Q   Q M N R J K Z M G R S W M F.   J A T W   X H   -
+ 4 * 6 + 4   4 9 1 3 6 5 3 9 4 3 7 8 9 *    6 + 2 8   5 9

                  8                  9              10
                                      .   . .
A W F.    F I Q Q W F F X I H   F K H B A O Z   J S M A H H F.
+ 8 *     * 6 4 4 8 * * 5 6 9   * 5 9 1 + 3 3   6 7 9 + 9 9 *

    11                 12                     13
.       .   .           . .       .
T G A H P K D   X M A W O V F S A R F    X H K I M A F S.
2 4 + 9 1 5 1   5 9 + 8 3 3 * 7 + 3 *    5 9 5 6 9 + * 7

(two digit figures F=13=* ; A=11=+)

Vowels contact the low frequency letters more often than do
consonants.  About 80% of the time.  We use S-TUCK method
combined with our text. [ELCY]  [TUCK]

We go thru A-2. writing down the contact letters on both sides,
for low frequency CT.  We tally one for each contact. If a CT
letter is between two low frequency letters we tally 2.
Contacts for low frequency letters touching each other = 0.  We
do not count N o R in word 2, and in word 1, W contacts V, so W
is tallied with 1.  A an S contact Z, so both A and S are
credited.   We get:

     /////  ////  //   ///   ///   //   ///   //   //
      W      A     S    G     M    J     K    H    F

                Low Frequency Contacts for A-2.








From the Brown Corpus, vowel contact as percentage of total
number of digrams is low:                          [CCF]


                Second
           A   E   I   O   U   Y

       A   0   0  .4   0   .1  .3
                                     Total nonpairs = 5.1%
       E  .7  .4  .2  .1   0   .2             pairs = 0.7%
  F
  I    I  .2  .4   0  .7   0    0
  R
  S    O  .1  .1  .1  .3   1.0  0
  T
       U  .1  .1  .1  0    0    0

       Y  0   .1   0  .2   0    0

ELCY tells us quite a bit about vowel behavior.

1.  A, E, I, O, are normally high frequency, U is moderate and
    Y is low frequency.

2.  Letters contacting low frequency letters are usually
    vowels.

3.  Letters showing a wide variety of contact-letters are
    usually vowels.

4.  In repeated digrams, one letter is usually a vowel.

5.  In reversed digrams, one letter is usually a vowel.

6.  Doubled consonants ar usually flanked by vowels, and visa
    versa.  ( cvvc or vccv)

7.  It is unusual to find more than 5 consonants in succession.

8.  Vowels do not often contact each other.

9.  If the CT letter with highest frequency is assumed E, any
    other high frequency letter which never touches E, can be
    assumed a vowel.  A letter that contacts it very often can
    not be a vowel.

10. E is most frequent vowel and rarely touches O.  Both double
    freely.

11. The vowel that follows and rarely precedes E is A.

12. The vowel that reverse with E is I.

13. Observations 11 and 12 apply to the vowel O.  However,
    finding U it precedes E and follows O.

14. The only vowel-vowel digrams of consequence are OU,EA,IO.

15. Three vowels in sequence may be IOU, EOU, UOU, EAU.

NYPHO's Robot says that the first four or last four letters of
a word contain a vowel.                           [TUCK]

ELCY defines high frequency letter behavior.

About 70% of the language is made up of E, T, A, O, N, I, R, S,
H.  This high frequency group has three cliques.

  Class I.   T, O, S appear frequently both as Initials and
             Finals; terminal O in short words like to.  All
             double freely

  Class II.  A, I, H appear frequently as initials, but rare as
             finals, especially A, I.  They do not readily
             double.

  Class III. E, N, R, appear frequently as finals, less
             frequently as initials, frequently double,
             especially E, N and R not so often.


When one of these letters changes its class, the least likely
exchange is one occurring between Class II and III.

ELCY gives us tips for identifying consonants:

1.  Those letters still remaining in the high frequency section
    will usually include T, N, R, S, H.  H is the easiest to
    identify, it precedes all vowels, and forms TH, HE, HA.

2.  R is also recognizable with it reverses openly with all
    vowels, and links with the class I club.

3.  T is usually found by frequency, precedes vowels rather
    than follow them, precedes consonants.  S has a similar
    pattern to a lesser degree.  N confuses this picture.

4.  ST -TS AND RT -TR are the only frequent consonant
    reversals.

5.  TT and SS are most frequent doubles in language.

Having all this information, we are well armed against even the
most resistant Aristocrat.

We return now to solution of A-2.

From the number of their contacts, W and A are most likely
vowels.  G, K, M are next most likely.

We look at these letters in the position table.

W. has the looks of E even though it is not the most frequent.

A. cannot be A so it might be I.  but frequency may be too
   high.

G. and K. have inside positions and look like vowels but can
   not be identified.

M. might be O by frequency but is confused with R.


A study of A-2. shows that W and A reverse which might be ei
and ie.  AG reverses which might be io or ia.  M repeats, and
reverses with W and G.  It most likely is R not O.  K does not
contact W A G or M.  We mark the cipher with W A G K as vowels
and M as a consonant,  putting in the assumed values.



A-2.  [no clue]                                S-TUCK

         1                2           3            4

d e l i g h t f u l   h o u r s   r e   a r d   e   t h   s
. v c v .     c v c     v v c c   c v . v c .   v .     v c
V W H A Z S J X I H   S K I M F   M W C G M V   W O J S I F  -
3 8 9 + 3 7 6 5 6 9   7 5 6 9 *   9 8 1 4 9 3   8 3 6 7 6 *

                          5                    6       7

i a s t i c   c r     t o g r     h e r s    t i       f l
v v c   v c   c c . .   v . c v .   v c c      v . v   c c
A G F J A Q   Q M N R J K Z M G R S W M F.   J A T W   X H   -
+ 4 * 6 + 4   4 9 1 3 6 5 3 9 4 3 7 8 9 *    6 + 2 8   5 9

                  8                  9              10

i e s     s u c c e s s f u l   s o l   i   g   t h r i l l s
v v c     c v c c v c c c v c   c v c . v . .       c v c c c
A W F.    F I Q Q W F F X I H   F K H B A O Z   J S M A H H F.
+ 8 *     * 6 4 4 8 * * 5 6 9   * 5 9 1 + 3 3   6 7 9 + 9 9 *

    11                 12                     13

  a i l   o     f r i e   d s h i   s    f l   u   i s h
. v v c . v .   c c v v . . c   v . c    c c v v c v c
T G A H P K D   X M A W O V F S A R F    X H K I M A F S.
2 4 + 9 1 5 1   5 9 + 8 3 3 * 7 + 3 *    5 9 5 6 9 + * 7


Using Nympho' robots rule, in Word 1, J X I H, one must be a
vowel.  Word 8 shows F X I H contains a vowel.  Word one
suggest the ending 'ful'.   X = f and H = l.  Examine X I H
and the I is in the vowel positions. (inner positions).  So the
vowels are now W E G K I.   From its end position F =s.  In
words 4 and 11, GA reverses so G cannot be a u for ui is not a
reversal.  We try KI=ou, therefore G = A.  Put into the above
cipher tableaus.   Word 5 breaks the two c's, so Q = c.
Word 1 might be delightful, so V=d, ZSJ = ght.  Remember the
second letter position favors vowels.               [ROBO]


The message reads: Delightful hours reward enthusiastic
cryptographers.  Time flies. Successful solving thrills.
Mailbox friendships flourish.  KW =K1=salutory.


PATTERN WORD ATTACK

Pattern words are words for which one or more letters are
repeated such as awkward, successful, interesting, unusually.
Aegean Park Press publishes pattern word books from 3 - 16
letters.  Pattern words lists are indexed by key letters or
figures or by vowel consonant relationships.  [BARK]  Pattern
words give a quick wedge into the cryptogram.  One of the best
Pattern Word Dictionaries is the Cryptodyct.  [GODD]



The Crypto Drop Box has the TEA computer program which gives
automated pattern searching and anagraming up to 20 words.  It
is a very effective tool.

In A-2.  We find a prize in word 8.   Using a key letter
approach:

                A B C C D A A E B F
                F I Q Q W F F X I H
   or
                1 2 3 3 4 1 1 5 2 6   = (334) 11526 [10L]
                F I Q Q W F F X I H

The first pattern found on page 310 Appendix of [CCF] is
successful.   The Cryptodyct uses the latter indexing method
and under 10 letter words we find that the 334 11526 pattern
equals successful.

Cryptographers generate their own special lists:

Transposals: from, form; night, thing; mate, meat;
Queer words: adieu, crwth, eggglass, giaour, meaow
Consonant sequences: dths, lcht, ncht, rids, ngst, rths
Favorite ins: people, crypt, success,

Using the TEA model, it was necessary to assume the
vowels at u and e for a 1u22e445u6 template to get
successful and juggernaut on the first try.

Non Pattern word lists are those with words that do not have
even one repeated letter, such as come, wrath, journey.  They
are very useful in attacking Patristrocrats and very difficult
Risties.

OMAR gave us this fine list in order of frequency:

   CRYPT   WORDS   ABOUT   KNOWS   BELOW   OKAPI   SWORD
   BLACK   ALONG   AFTER   NEGRO   EXTRA   PLACE   THREW
   WATCH   CRAZY   CAUSE   UNDER   FIRST   SIXTY   WRONG
   WHILE   CROWD   DRUNK   UPSET   FOUND   STUDY
   ANGRY   PLUMB   EMPTY   YIELD

We will come back to it in the Patty section.

Also in the CDB is a program called ASOLVER which automates
the Digram solution method to get the best fit.


MORE ABOUT VOWEL POSITION PREFERENCES

Dr. Raj Wal summarized Barkers Vowel Preferences data.
He also developed cross correlation coefficients for each
letter.  Foster details this work in his book.    [CCF]


This handy little table gives us an entry when needed.  It is
correct more times than it fails.


      Word Length    Position Preferences

         one         1
                     V

         two         1   2
                     V   C

         three       1   2   3
                     C   C   -

         four        1   2   3   4
                     C   V   -   C

         five        1   2   3   4   5
                     C   C   V   C   C

         six         1   2   3   4   5   6
                     C   V   C   -   -   C

         seven       1   2   3   4   5   6   7
                     C   V   C   C   -   -   C

         eight       1   2   3   4   5   .   .   Final
         plus        C   C   -   -   -   -   -     C


Note the vowel preference in the second column.    S-TUCK
describes a method that uses the above table for long word
cryptograms.  She lines the words up under each other and
compares the letter positions with each other.  Using the
columnar method (named by Sherlack) on A-2 we would have
found an incredible four of the vowels!  The same process of
marking the low frequency consonants and word endings would
have given us about half the letters.  Wayne Barker developed a
course based on this method.                       [BAR2]




"DOOSEYS"  = TOUGH ARISTOCRATS

CODEX, MICROPOD and ZYZZ are among the best tough "risties"
constructors.  A tough ristie is a fascinating form of simple
substitution with word division in which the message is of no
importance whatever and the encipherer's full attention has
been given to the manipulation of letter characteristics.
Both ELCY and S-TUCK present versions of George C. Lamb's
Variety of Contact or Consonant Line Approach.  I shall use
ELCY's version and example and expand the consonant line
approach to make it more understandable.   We start with:


A-3.  No clue.  Author Bosley No. 19.  CM.  June 1936.
          1                2                   3

     U W Y M N X K A    E H X R B Z      U V X M U W B Z

          4                  5                 6

O Y Z T W H V C X Y A     C Y A U Z    D B R A H V K B A;

          7                     8            9

Z W S V A H K U Z B K C,     M S C X     C Y X B S,

         10

X V Z Y T R Y C X P.                      (104L)


CONSONANT-LINE METHOD

The object is to isolate a small group of consonants.  Whereas
frequency data can be manipulated, variety of contact data
cannot.   We start with 1) a list of CT contacts in order of
appearance of the letters and 2) rearrange these CT letters in
order of decreasing variety of contacts.

                         A-3. Contacts

 5U6  4W7  7Y9  3M5  1N2  8X10 4K7  6A7  1E1  4H6  3R5  6B8
 ---  ---  ---  ---  ---  ---  ---  ---  ---  ---  ---  ---
 -|W  U|Y  W|M  Y|N  M|X  N|K  X|A  K|-  -|H  E|X  X|B  R|Z
 -|V  U|B  O|Z  X|U   |   H|R  V|B  Y|-   |   W|V  B|A  W|Z
 M|W  T|H  X|A  -|S   |   V|M  H|U  Y U   |   A|V  T|Y  D|R
 A|Z  Z|S  C|A   |    |   C|Y  B|C  R|H   |   A|K   |   K|A
 K|Z   |   C|X   |    |   C|-   |   B|-   |    |    |   Z|K
  |    |   Z|T   |    |   Y|B   |   V|H   |    |    |   X|S
  |    |   R|C   |    |   -|V   |    |    |    |    |    |
                          C|P







 7Z6  5V8  1O1  2T4  6C5  1D1  3S5  1P1
 ---  ---  ---  ---  ---  ---  ---  ---
 B|-  U|X  -|Y  Z|W  V|X  -|B  W|V  X|-
 B|-  H|C   |   Y|R  -|Y   |   M|C   |
 Y|T  H|K   |    |   K|-   |   B|-   |
 U|-  S|A   |    |   S|X   |    |    |
 -|W  X|Z   |    |   -|Y   |    |    |
 U|B   |    |    |   Y|X   |    |    |
 V|Y   |    |    |    |    |    |    |


 Variety of Contact Table (VOC):

 Freq: 8  7  6 5 4 4 6 5 4 7  /  3 3 6 3  /  2 1 1 1 1 1
 VOC:  10 9  8 8 7 7 7 6 6 6  /  5 5 5 5  /  4 2 1 1 1 1
 CT:   X  Y  B V W K A U H Z  /  M R C S  /  T N E O D P



We start with the position that 20% of the text represented by
variety count are consonants.  20% of 104 = about 21.  The line
of demarcation is between R and C but 4 letters have the same
VOC of 5, M,R,S,C.  If we take one , we must take all and one
of these most likely is a vowel.   The key to solution is the
VOC "step up" versus "step down" observation.  Vowels tend to
step up and Consonants tend to step down.  [i.e. 3M5 is a step
up of 2 points and 6C5 is a step down of one point.]

M, R, S all step up, C steps down 1 point and most likely is a
consonant.    We develop a separation line and place the
contacts on each side of the consonant line starting from the
right of the VOC table.

                 First Consonant Line
                    C T N E O D P
                 ---------------------
                        V |
                        X | XXXX
                       YY | YYY
                        K |
                        S |
                        Z |
                          | W
                          | R
                        M |
                          | H
                          | B

If any letter does not appear at all below the line, that
letter is most likely a consonant.  A and U fall into this
catagory.  We add these to analysis:







          Second Consonant Line
            C T N E O D P A U
          ---------------------
                  VV | V        mark X and Y as Vowels
                   X | XXXX     (vowel)  both step up
                YYYY | YYY      (vowel)  with high VOC
                 KKK |
                   S |
                   Z | ZZ      consonant (step down)
                     | WWW     test as h
                   R | R
                  MM |
                     | HHH
                   B | B
                     | U
                   A |
                     |

We shift to A-3 and mark in the suspected consonents.


A-3.  No clue.  Author Bosley No. 19.  CM.  June 1936.
cont      1                2                   3

     U W Y M N X K A    E H X R B Z      U V X M U W B Z
     - - o - - o --     - o o - o -      - - o - - - o -

          4                  5                 6

O Y Z T W H V C X Y A     C Y A U Z    D B R A H V K B A;
- o - - - o - - o o -     - o - - -    - o - - o - - o -

          7                     8            9

Z W S V A H K U Z B K C,     M S C X     C Y X B S,
- - o - - o - - - o - -      - o - o     - o o o o
         10

X V Z Y T R Y C X P.                      (104L)
o - - o - - o - o -


n and h turn up on the right and left side of the consonant
line freely.   w and h are candidates.  Since h=H, then w
might equal h.   Digrams such as sh or ch are prevalent. W is
the second position in word 7 which tentatively confirms the
PT h and suggests that Z is a consonant (step down).  B is
astep up as well as S.  The third word confirms but the 9
word has four vowels.   Hmm?  K and H are both possibilities
for vowels.  Word 4 tends to favor the H.   So:








          Final Consonant Line
          C T N E O D P A U W Z
          ---------------------
               VVV | V        mark X and Y as Vowels
                 X | XXXX     (vowel)  both step up
             YYYYY | YYYYY    (vowel)  with high VOC
               KKK |
                 S | S       vowel low freq? =u?
                ZZ | ZZ      consonant (step down)
                   | WWWW    test as h
                 R | R
                MM |
                   | HHHH
               BBB | BBB      vowel
              UUUU | U        consonant
                 A |          consonant
                 T | T        consonant

Let me fill in where ELCY stops.  A-3 has vowels and consonants
separated.  We have the PT letter h.  Word 9 is either clever
or wrong.  Using Barkers Pattern List on p39, we find bayou and
miaou.  The same reference gives us thunderclaps for word 7.
Although not correct we find thunderstorm matching the pattern
under 819710/12W and word 8 suggests puma.  The final message
reads: shipyard zealot snapshot kitchenmaid midst goldenrod;
thunderstorm, puma miaou, anticlimax.

The TEA database yields words: thunderstorm and anticlimax.
The reader is invited to reconstruct the keywords, if any.

NON-PATTERN WORD ATTACK

Try this Aristocrat.

A-4.  Fire, fire burning bright.  by Ah Tin Dhu.

   1           2           3            4           5
A B C D E   A C F G H   I C J F H    K C I B L   K F B H L

   6           7           8            9          10
K C M J N   O M J P I   B H L M C    M R S P E   B C A I H

   11          12          13           14          15
T I A U H.   K U M C E   V D U H P.   S C F G D   J W B I L

   16           17          18           19
J S U M L   D U V N P,   V E O M L   C F G L E.

To solve by using non-pattern words, 3 or 4 words in the cipher
having several letters in common.  Under one of these write 5
or 6 words from the pattern list.  We will use OMAR's list
given previously.   Note the initials and final letters and
letter positions of the trial words.  In A-4. K is an initial
and L is a terminal.  Choose the non-pattern words to conform
with this requirement.  We write the common letters under the
trial word and try to make clear message out of the balance of
CT.  Word 5 has K, BHL and F.

  K F B H L     A C F G H     K C I B L     B H L M C
1 b l a c k         l   c     b     a k     a c k
2 c r a z y         r   z     c     a y     a z y
3 w r o n g         r   n     w     o g     o n g
4 c r o w d         r   w     c     o d     o w d
5 d r u n k         r   n     d     u k     u n k
6 f o u n d         o   n     f     u d     u n d

Line 6 arson, fraud, under.   Putting this into the risties
we get:

   1           2           3            4           5
b u r   y   b r o w n   a r s o n    f r a u d   f o u n d
A B C D E   A C F G H   I C J F H    K C I B L   K F B H L

   6           7           8            9          10
f r e         e     a   u n d e r    e       y   u r b a n
K C M J N   O M J P I   B H L M C    M R S P E   B C A I H

   11          12          13           14          15
c a b i n    f i e r y       i n       r o w          u a d
T I A U H.   K U M C E   V D U H P.   S C F G D   J W B I L

   16           17          18           19
    i e d     i           y    e d   r o w d y
J S U M L   D U V N P,   V E O M L   C F G L E.


All the vowels are id'ed and r, n.  The message is "Burly brown
arson fraud found fresh vesta under empty cabin.  Fiery glint.
Prowl squad spied light, gyved rowdy."


RECAP

1. Common letters appear frequently in a message but not
   necessarily in exact correspondence to the uniform frequency
   distribution.

2. Start working with shorter words, common endings.

3. Look for repetitions of bigrams, trigrams, reversals.

4. Go with the flow  of the cipher text and extract all the
   information on frequency, position and contacts.

5. Eliminate all but few possibilities. Test and confirm. Test
   and Confirm.

6. Work back and forth from the cryptogram and the keyword
   alphabets.  Expect the message to make some kind of sense.

7. Look for patterns or non patterns.  Separate vowels and
   consonants.  Try brute force.  Use lists.

8. Persevere.


CM REFERENCES

PHOENIX has compiled a list of articles (page 2) concerning
ARISTOCRATS between 1932 - 1993 in "The Cryptogram Index,"
available through the ACA.  On page 27, he lists additional
references on simple substitution.   Articles by B.NATURAL
and S-TUCK are especially useful.                    [INDE]


HOMEWORK PROBLEMS

Solve these cryptograms, recovery the keywords,  and send your
solutions to me for credit.  Be sure to show how you cracked
them.  If you used a computer program, please provide "gut"
details.  Answers do not need to be typed but should be
generously spaced and not in RED color.  Let me know what part
of the problem was the "ah ha", i e. the light of inspiration
that brought for the message to you.

A-1.  Bad design.  K2 (91)                          AURION
V G S   E U L Z K   W U F G Z   G O N   G M   V D G X Z A J U =

X U V B Z     H B U K N D W   V O N   D K   X D K U H H G D F =

N Z X   U K   Y D K   V G U N   A J U X O U B B S

X D K K G B P Z K   D F   N Y Z    B U L Z .

A-2.  Not now.  K1 (92)                        BRASSPOUNDER
K D C Y   L Q Z K T L J Q X   C Y   M D B C Y J Q L :   " T R

H Y D    F K X C ,     F Q   M K X   R L Q Q I Q   H Y D L

M K L   D X C T W   R D C D L Q   J Q M N K X T M B

P T B M Y E Q L   K   F K H   C Y   L Q Z K T L   T C . "


A-3.  Ms. Packman really works!  K4 (101)        APEX DX
* Z D D Y Y D Q T   Q M A R P A C ,   * Q A K C M K

* T D V S V K .   B P   W V G   Q N V O M C M V B :   L D X V

K Q A M S P D   L V Q U ,  L D B Z I   U V K Q F   P O

W A M U X V ,   E M U V P   X Q N V ,  U A M O Z

N Q K L M O V   ( S A P Z V O ) .


A-4.  Money value.  K4 (80)             PETROUSHKA
D V T U W E F S Y Z   C V S H W B D X P   U Y T C Q P V

E V Z F D A   E S T U W X   Q V S P F D B Y   P Q Y V D A F S ,

H Y B P Q   P F Y V C D   Q S F I T X   P X B J D H W Y Z .


A-5.  Zoology lesson.  K4 (78)          MICROPOD
A S P D G U L W ,   J Y C R   S K U Q   N B H Y Q I   X S P I N

O C B Z A Y W N = O G S J Q   O S R Y U W ,   J N Y X U

O B Z A   ( B C W S   D U R B C )   T B G A W   U Q E S L.

* C B S W







REFERENCES

[ACA]  ACA and You, Handbook For Members of the American
       Cryptogram Association, 1995.

[BARK] Barker, Wayne G., "Cryptanalysis of The Simple
       Substitution Cipher with Word Divisions," Aegean Park
       Press, Laguna Hills, CA. 1973.

[BAR1] Barker, Wayne G., "Course No 201, Cryptanalysis of The
       Simple Substitution Cipher with Word Divisions," Aegean
       Park Press, Laguna Hills, CA. 1975.

[B201] Barker, Wayne G., "Cryptanalysis of The Simple
       Substitution Cipher with Word Divisions," Course #201,
       Aegean Park Press, Laguna Hills, CA. 1982.

[BP82] Beker, H., and Piper, F., " Cipher Systems, The
       Protection of Communications", John Wiley and Sons,
       NY, 1982.

[CCF]  Foster, C. C., "Cryptanalysis for Microcomputers",
       Hayden Books, Rochelle Park, NK, 1990.

[DOW]  Dow, Don. L., "Crypto-Mania, Version 3.0", Box 1111,
       Nashua, NH. 03061-1111, (603) 880-6472, Cost $15 for
       registered version and available as shareware under
       CRYPTM.zip on CIS or zipnet.

[ELCY] Gaines, Helen Fouche, Cryptanalysis, Dover, New York,
       1956.

[GODD] Goddard, Eldridge and Thelma, "Cryptodyct," Marion,
       Iowa, 1976

[FR1]  Friedman, William F. and Callimahos, Lambros D.,
       Military Cryptanalytics Part I - Volume 1, Aegean Park
       Press, Laguna Hills, CA, 1985.

[FR2]  Friedman, William F. and Callimahos, Lambros D.,
       Military Cryptanalytics Part I - Volume 2, Aegean Park
       Press, Laguna Hills, CA, 1985.

[FRE]  Friedman, William F. , "Elements of Cryptanalysis,"
       Aegean Park Press, Laguna Hills, CA, 1976.

[HA]   Hahn, Karl, " Frequency of Letters", English Letter
       Usage Statistics using as a sample, "A Tale of Two
       Cities" by Charles Dickens, Usenet SCI.Crypt, 4 Aug
       1994.

[INDE] PHOENIX, Index to the Cryptogram: 1932-1993, ACA, 1994.

[NIC1] Nichols, Randall K., "Xeno Data on 10 Different
       Languages," ACA-L, August 18, 1995.

[NIC2] Nichols, Randall K., "Chinese Cryptography Part 1," ACA-
       L, August 24, 1995.

[OP20] "Course in Cryptanalysis," OP-20-G', Navy Department,
        Office of Chief of Naval Operations, Washington, 1941.

[ROBO] NYPHO, The Cryptogram, Dec 1940, Feb, 1941.

[SINK] Sinkov, Abraham, "Elementary Cryptanalysis", The
       Mathematical Assoc of America, NYU, 1966.

[STIN] Stinson, D. R., "Cryptography, Theory and Practice,"
       CRC Press, London, 1995.

[TUCK] Harris, Frances A., "Solving Simple Substitution
       Ciphers," ACA, 1959.

Notes

Throughout my lectures,  PT will be shown in lower case.  CT
will be shown in upper case.  As a convention, Plain text will
generally be shown above the Cipher text equivalent.

A = Aristocrats, P = Patristrocrats, X = Xenocrypts

Any typo errors are my responsibility.  I probably fell asleep
at the keyboard.  Please advise and I will correct them as well
as put out an erratum sheet at the end of the course.  Students
may want to start a 3" permanent binder with separators for the
various lectures and materials.
















                            OUTLINE

    1. Intro - First Principles - Global Mathematical Nature
    2. Keyword Systems and Conventions Used
    3. Simple Substitution Cryptanalysis without/with
       Complexities

         a. Eyeball
         b. Frequency Distributions - General Nature of English
            Letters
         c. Friedman Techniques - Random vs Expected -Spaces
            and a Wealth of Tables: Digram, Trigram, and more
         d. C. C. Foster Techniques
         e. S-Tuck Techniques
         f. Pattern Words
         g. ELCY : Consonant Line Attack
         h. Sinkov Techniques
         i. Barker's Vowel Separation and Position Table
         j. Non Pattern Words: "Dooseys"
         k. SI SI Patterns
         l. CM References for Risties
         m. Relationship to XENOS:French and German Solutions
         n. Computer Program Aids  - TEA Database, CDB, ABACUS,
            Computer Supplement
         o. References


     4.  Homework Problems

     5.  Variant Substitution Systems

           a. Friedman
           b. Waxton


Next lecture we will cover the balance of the outline material
and jump into Patristocrats.
Back to index