Readme for Results 
obtained with the IIT-NRC developed FRiV program 
http://synapse.vit.iit.nrc.ca/memory/exe/friv-dec04-batch.exe

using
IIT-NRC Facial Video Database

NRC-CNRC | IIT-ITI | Computational Video Group | Perceptual Vision TechWeb | FRiV technology | Facial Database


This page is created to supplement the results presented in publications and Log files
And it is under construction. 


Remark 

Since the results  very often (sometimes significantly) vary even for the same parameter/ technique, depending on other parameters and techniques used, we shown only show only those of them which are most representative (but not necessarily, the best) for the specific parameter/ technique tested.

In some cases, there is a clear indication on which parameters and techniques improve recognition or make it worse. In some others, while for majority of video clips there may be seen improvements, this may not be observed on all video clips. Here we  present only those results which are consistent for the entire database.

Logs files should consulted for a better understanding of the results.

 

Experimental Results

Results on recognition of faces in the video clips of this database are shown below. 
The first of the pair of video clips is used to memorize a face. 
The second of the pair is used to test the quality of memorization and to obtain the statistics related to the recognition quality.

In memorization stage, the program 
- extracts a face from video, where possible, using video-processing processing techniques, which include 
    - face looking region detection based on Haar-like features analysis available at OpenCv, 
    - skin detection based on several skin colour models and several non-linearly transformed colour spaces, and 
    - motion / foreground analysis based on the motion history and second order change detection, 
- selects the face region known to be of the most significance for face recognition,
- transfers this region to the canonical face model of the nominal face resolution, while performing several normalization preprocessing steps, where possible, such as 
    - detection of the facial orientation in the image plane, and 
    - eye alignment
- converts thus obtained nominal size face image into binary feature vector Vf using the thresholded version of either one or all of : 
    - average intensity normalized  image,  
    - illumination invariantly normalized image based on the local view transform, and 
    - the  gradient images of the face image 
    - (other pixel interrelationship preserving encoding schemes will be considered in future work, if needed)
- creates the face name tag vector Vt of size equal to the number of individuals to be memorized,  by flipping one neuron Vid corresponding to person's ID from -1 to +1; other neurons are left unexcited: Vnot-id=-1), which is appended to the face name tag vector Vt  to create the  binary vector V=(Vf,Vt) to be send to the memorization module. %memorized as an attractor. 
- Finally,  the program memorizes the obtained vector by making it  an attractor of  the attractor based neural network.  This is accomplished by executing  a one take (close-form solution based) update of the synaptic weights of the network; i.e. each synaptic weight Cij undertakes a small, either positive or negative, increment dCij, which, in general, should be a function of stimulus (i.e. V) and what has been memorized previously (which is stored in C), i.e  dCij = f(V,C).

In the Hebbian synapse update learning rule, which is probably one of simplest possible update rules, the weight increment is a function of only corresponding stimulus components: dCij = f(Vi,Vj). Another very well known rule, the Widrow-Hoff delta iteration-based learning rule is a better choice for the model as it takes into account , in some sense, the other parameters. Yet the best first-order learning rule (i.e. the rule which takes only the two-neural relationship) for the binary fully-connected neuron network, in both theoretical and practical sense,   is  the Pseudo-Inverse Learning Rule, especially the one which has its self-connection partially reduced.

It has the highest capacity and the best error correction capability,  guaranteed to be of a certain level for a given capacity. It is also guaranteed to converge to an attractor even using the parallel (synchronous) dynamics, which makes the network in recognition, as long as it has symmetrical weight matrix (Cij = Cji) and the cycle detection mechanism is used to detect the two-state cycles, which  are the only possible cycles which could occur in the network learnt by the PI rule.

Thus, as long as the stimulus vector is made an attractor of the network, which is easily verified by the condition V*CV>0, and the network is not saturated, which can be verified by the considering the ration Cii/Cij: it should be less than 1/2, the stimulus vector is quarantined to be retrieved with the best associative  recall.

The entire process: from capturing image to memorizing a face along with its ID, takes about 100 msecs on Pentium 4 processor, of which memorization takes about half. This allows one to memorize faces from video on fly in real time as they are being observed by the camera.

 

In recognition stage, the same chain of steps: from video frame capturing to binary encoding of the detected face (Yf-query) is done. The name tag vector, which is appended to face feature vector, is left unchanged (i.e. all neurons are at rest: Yi=-1).
As the network evolves, one, some or none of the name tag neurons get(s) excited. This is further analyzed in the context of confidence and  repeatability. If several name tag neurons are excited, it means that the system is unsure. At the same time, the result should be sustainable within short period of time, i.e. the same name tag neuron should get excited at least within a few consecutive video frames. Only then 

Such recognition process: from video image capture to telling the person's ID is also very fast. In fact, it is even faster than the memorization process which involves many multiplications and divisions.

While it may seem  that, because of  many iterations and the large number of neurons,  it takes long to compute all postsynaptic potentials (PSP) in every iteration of the network's evolution, it is not. This is because in every iteration, as proposed in our earlier work, instead of considering all neurons of the network for computing PSP as 
$$ S_j^t = \sum_{i=1...N}{C_{ij}Y_i^t $$, we consider only those neurons $k$ which have changes since the last iteration to compute it as 
$$ S_j^t = S_j^{t-1}  - 2 \sum_{k}{C_{kj}Y_i^t . Since number of this neurons drops down drastically as the network evolves, the number of multiplications becomes very small. 

Memory size

In the results below, N indicates the neural network size  measured in the number of mutually interconnected binary neurons. 
The actual memory size required for the network is N*(N+1)/2*BytesPerWeight. Division over two is thanks to the fact that the weight matrix is symmetric, which is theoretically known requirement for the network to converge.  Experiments show that representing Weights using one byte (as signed char) is not sufficient, while using two bytes (as float) is quite adequate for  good recognition results \footnote {Though storing weights as float leads to breaking certain theoretical properties of the PI network, such as CC=C. and as a consequence Si=CYi may become larger than 1 in absolute value. This  does not show adverse effect on the results, yet may have to be taken care of in future experiments}.  

Thus the network of size N=587,  which as as shown below can serve quite well as an associative memory for  recognition of up to 10 faces,  occupies less than 1Mb on hard drive, while the network of size N=1739 which shows better associative recall, occupies 7Mb.

 

Statistics computed

The results presented are obtained and can be duplicated using  either the batch program friv-bat-nov04.exe or  GUI program gui-nov-jan04.exe which are available for free download and testing from /exe directory.

For each test the following 5 statistics, indicated as S10 S11 S01 S00 and S02, are  computed.

S10: neuron corresponding to the correct person's ID fired (+1), while all other neurons remained at rest (-1). This is the best case performance: no hesitation in saying the person's name from a single video frame.

S11: neuron corresponding to the correct person's ID fired (+1), but there were some other neurons (at least one) which also fired. This "hesitating" performance can also be considered good, as it can be taken into account whn  making the final decision based on the average (or majority) of decisions over several video frames. This result can also be used to disregard the frame as "confusing".

S01: neuron corresponding to the correct person's ID did not fire (-1), while another name tag neuron corresponding to a different person fired (+1). This is the worst case result. It however is  not always bad either. First , when this happens often there are other neurons  which fire too, indicating the inconsistent decision - this is denoted as S02 result. Second, unless  this results persists within several consecutive frames (which in most cases it is not - see the log file) it can also be identified as invalid result and thus be discarded.

S00: none of the name tag neurons fired. This result can also be considered as a good one, as it indicates that the network does not recognize anybody. This is, in fact, what we want the network to produce when it examines a face which has not been previously seen or when it examines part of the video image which has been erroneously classified as a face by the video processing modules, which, as our experiments show, happens too.


The effect of neural network self-connection reduction (also known as desaturation of the network: Cii = D*Cii)

On 10 (11) video clips

10.1 (7): 285 frames in training. 11.2 (1): 2615 frames in testing. 

10 (every 7th frame used) clips, the first of the pair, are used for memorization. 
11 (every frame used)  clips,, the first of the pair, are used in recognition.
The first of these (ID 0) was not memorized.

nIDs=11,  D=0.10

Small network (using the intensity values only):  

Network: N=587=24x24+11 (00001-100) 

Log file: _100-587-10.1(7)-11.2(1)-d=0.1.log

11 clips, 285 frames in training. 2615 in testing. 
Statistics:   10   11   01   00  |  02
ID 0 |   0 | 0 |   71 |  110 |  13   <-- not memorized !!!
ID 1 |  48 | 0 | 1 | 5 |   0
ID 2 | 160 | 7 | 9 | 9 |   1
ID 3 | 226 |   10 |   18 |   56 |   0
ID 4 |  78 | 7 |   86 |   96 |   6
ID 5 |  20 | 2 |   16 |   84 |   3
ID 6 | 140 | 6 |   22 |   53 |   1
ID 7 | 187 |   25 |   17 |   10 |   1
ID 8 | 235 |   60 |   24 |   80 |   3
ID 9 | 122 |   17 |   42 |  101 |  17
ID 10 | 231 |   12 |   23 |   33 |  11
Total:| 1447 |  146 |  329 |  637 |  56
Network: N=587 (00001-100), nIDs=11 || D=0.10, T=1, S0=0.00  

Log file: _100-587-10.1(7)-12.2(1)-d=1.log

12 clips, 285 frames in training. 2813 in testing. 
Statistics:   10   11   01   00  |  right, but all<S |  wrong, but many>S 
ID 0 |	    0 |    1 |   61 |  135 	 \\ \hline  %4i     0    1 <-- not memorized !!!
ID 1 |	    0 |    0 |   32 |  154 	 \\ \hline  %4i     0    8 <-- not memorized !!!
ID 2 |	   44 |    0 |    0 |   10 	 \\ \hline  %4i     0    0
ID 3 |	  166 |    4 |    3 |   12 	 \\ \hline  %4i     0    1
ID 4 |	  214 |    1 |    7 |   88 	 \\ \hline  %4i     0    0
ID 5 |	   50 |    2 |   21 |  198 	 \\ \hline  %4i     0    2
ID 6 |	    7 |    1 |    8 |  109 	 \\ \hline  %4i     0    0
ID 7 |	   72 |    1 |    7 |  142 	 \\ \hline  %4i     0    0
ID 8 |	  194 |    7 |    3 |   36 	 \\ \hline  %4i     0    0
ID 9 |	  264 |   23 |   14 |  100 	 \\ \hline  %4i     0    1
ID 10 |	  100 |   12 |   31 |  133 	 \\ \hline  %4i     0   23
ID 11 |	  148 |    3 |   17 |  126 	 \\ \hline  %4i     0   16
Total:|	  1259 |   55 |  204 | 1243 	 \\ \hline  %4i     0   52
Network: N=587 (00001-100), nIDs=11 || D=1.00, T=1, S0=0.00  
 

Large network  (using the intensities and gradients):

Network: N=1739=24x24x3+11 (00001-111)

10.1 (7): 285 frames in training. 11.2 (1): 2615 frames in testing. 
10 (every 7th frame used) clips, the first of the pair, are used for memorization. 
10 (every frame) clips, the first of the pair, are used in recognition.

10 clips, 285 frames in training. 2421 in testing. 
Log file: _111-1739-10.1(7)-0ab.2(1)-d=0.1.log
Statistics:   10   11   01   00  |   wrong, but many detected 
ID 0 |  49 | 4 | 0 | 1 |   0
ID 1 | 175 | 0 | 3 | 8 |   0
ID 2 | 288 | 1 | 2 |   19 |   0
ID 3 | 163 | 1 |   11 |   98 |   0
ID 4 |  84 | 2 | 3 |   36 |   0
ID 5 | 202 | 2 | 3 |   15 |   0
ID 6 | 208 | 3 |   12 |   17 |   0
ID 7 | 353 | 3 | 8 |   38 |   0
ID 8 | 191 | 8 |   30 |   62 |   8
ID 9 | 259 | 0 |   10 |   24 |  17
Total:| 1972 |   24 |   82 |  318 |  25
Network: N=1739 (00001-111), nIDs=11 || D=0.10, T=1, S0=0.00  
Log file: _111-1739-10.1(7)-10.2(1)-d=0.1.log
ID 0 |   0 | 1 |   70 |  112 |  15  <-- not memorized !!!
ID 1 |   0 | 0 |   73 |  112 |   9  <-- not memorized !!!
 
10.1 (7): 285 frames in training. 10.2 (2): 1210  frames in testing. 
10 (every 7th frame used) clips, the first of the pair, are used for memorization. 
10 (every 2nd frame used) clips, the first of the pair, are used in recognition.
 
D=0.10
D=0.15 D=1.00   Illumination invariant feature binarization (00001-211)
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      Total
      
25 | 3 | 0 | 0 	|   0
88 | 0 | 2 | 5 	|   0
142 | 1 | 1 |   10 	|   0
 79 | 0 | 6 |   48 	|   0
 41 | 2 | 1 |   18 	|   0
101 | 2 | 1 | 8 	|   0
107 | 1 | 4 | 8 	|   0
177 | 1 | 4 |   19 	|   0
 95 | 5 |   18 |   30 	|   4
129 | 0 | 4 |   12 	|   8
984 |   15 |   41 |  158 	|  12
D=0.05,
27 | 0 | 0 | 1 	|   0
87 | 0 | 0 | 8 	|   0
141 | 1 | 1 |   11 	|   0
 76 | 0 | 8 |   49 	|   0
 40 | 1 | 2 |   19 	|   0
102 | 2 | 1 | 7 	|   0
106 | 3 | 3 | 8 	|   0
178 | 1 | 4 |   17 	|   1
 93 | 4 |   15 |   35 	|   5
127 | 0 | 9 |   10 	|   7
977 |   12 |   43 |  165 	|  13

D=0.01,
983 |    8 |   46 |  160 |  
13

24 | 3 | 0 | 1 	|   0
87 | 0 | 1 | 7 	|   0
140 | 0 | 1 |   13 	|   0
 79 | 1 | 4 |   49 	|   0
 41 | 2 | 2 |   17 	|   0
105 | 0 | 1 | 6 	|   0
105 | 2 | 6 | 7 	|   0
180 | 1 | 5 |   15 	|   0
 90 | 4 |   13 |   38 	|   7
126 | 0 | 5 |   13 	|   9
977 |   13 |   38 |  166 	|  16 
22 | 1 | 0 | 5 	|   0
90 | 0 | 1 | 4 	|   0
144 | 0 | 1 | 9 	|   0
 90 | 1 | 1 |   41 	|   0
 37 | 3 | 1 |   21 	|   0
104 | 0 | 1 | 7 	|   0
108 | 2 | 2 | 8 	|   0
176 | 3 | 2 |   20 	|   0
 84 | 6 |   11 |   45 	|   6
122 | 1 | 4 |   17 	|   9
977 |   17 |   24 |  177 	|  15
 
22 | 0 | 0 | 6 	|   0
81 | 0 | 1 |   13 	|   0
145 | 1 | 0 | 8 	|   0
 75 | 1 | 4 |   53 	|   0
 35 | 0 | 4 |   23 	|   0
105 | 0 | 3 | 4 	|   0
 67 | 0 | 9 |   44 	|   0
172 | 3 | 5 |   21 	|   0
 87 | 7 |   12 |   46 	|   0
116 | 1 | 8 |   21 	|   7
905 |   13 |   46 |  239 	|   7
  
901 |   13 |   49 |  239 	|   8
D=0.10  

--------------------- (the part below is not ready yet)---------------------

The effect of cropping and encoding schemes (to be added soon)

On small network (intensities only) with 8 video clips: N=24x24+11=587 (00001-100)
Network: N=587 (00001-100), nIDs=11 
8.1 (7): 205 frames in training. 8.2 (2): 905 frames in testing. 

  Statistics: 10 11 01/02 00        
 
D=1.00
ID 0 | 26 | 0 | 0 | 2  
ID 1 | 90 | 2 | 2 | 1  
ID 2 | 125 | 0 | 2 | 27  
ID 3 | 43 | 2 | 29/1 | 58 
ID 4 | 12 | 1 | 7 | 42  0
ID 5 | 63 | 2 | 6 | 41  0
ID 6 | 108 | 2 | 1/1 | 8  
ID 7 | 149 | 12 | 16/1 | 23  1
Total:| 616 | 21 | 63/3 | 202  3


D=0.15 
634 
D=0.10
Total:| 658 | 37 | 97 | 109  4
 
 
 
 

N=587 (11***001-100), nIDs=11 || D=0.15, T=1, S0=0.00 
539

N=587  (11****001-200), nIDs=11 || D=0.15, T=1, S0=0.00 
612
N=587 (00001-2***00), nIDs=11 || D=0.10, T=1, S0=0.00 
615 | 42 | 54 | 189 

N= 1163(11***001-11**0), nIDs=11 || D=0.15, T=1, S0=0.00 
783
 
 
 
to be finished and organized

Created: 10.XII.2004. Last Updated: 14.IV.2005
Computational Video Group
, IIT-ITI, NRC-CNRC
Project Leader: Dmitry O. Gorodnichy.
Email for sending comments: memory@perceptual-vision.com
www.perceptual-vision.com (synapse.vit.iit.nrc.ca). 
Copyright 2004-2005