Warning: This content has not yet been fully revised for this year.
Readings in Perusall:
- Deep Learning with Python chapter 4
Optional Material
The optional Perusall assignment called “Additional Resources” includes some material that you might find helpful.
Equivalence of Softmax+Categorical Cross-Entropy and Sigmoid+Binary Cross-Entropy
A two-output softmax layer with categorical cross-entropy loss…
model = layers.Dense(2, activation='softmax')
model.compile(loss='categorical_crossentropy')
… is equivalent to a single-output sigmoid layer with binary cross-entropy loss:
model = layers.Dense(1, activation='sigmoid')
model.compile(loss='binary_crossentropy')
To see why, notice that for the first model the output is softmax([dot(x,w1)+b1, dot(x,w2)+b2]). But since softmax is unchanged when adding/subtracting the same thing from every input, we might as well subtract dot(x,w2)+b2, which gives us softmax([dot(x,(w1-w2))+(b1-b2), 0]). Since we only needed one dot-product for that, we can use a single output. The rest is just a difference in names:
softmax([q,0])has another name:sigmoid(q).- Binary cross-entropy loss
bce(p)is just a convenient way to write categorical-cross-entropycce([p, 1-p]). (This explains the confusing formula you’ll see for binary cross-entropy loss which has1-yand1-pin it.)
Aside: regularization and optimizer issues may mean that these models are not exactly equivalent in practice, but the differences will probably be small.