J. Imaging, Vol. 8, Pages 328: A Framework for Enabling Unpaired Multi-Modal Learning for Deep Cross-Modal Hashing Retrieval

Figure 1. The various pairwise relationships present in information retrieval datasets. (a) 1-1 Paired, (b) 1-1 Many Paired, (c) 1-1 Aligned Paired, (d) 1-Many Aligned Paired, and (e) Unpaired.

Figure 1. The various pairwise relationships present in information retrieval datasets. (a) 1-1 Paired, (b) 1-1 Many Paired, (c) 1-1 Aligned Paired, (d) 1-Many Aligned Paired, and (e) Unpaired.

Jimaging 08 00328 g001 Figure 2. Overview of an end-to-end deep hashing architecture. This figure illustrates a simplified recreation of the Deep Cross-Modal Hashing (DCMH) [9] network architecture (CNN: Convolutional Neural Network, BOW: bag of words, FC: Fully Connected layers). Example elephant (1), bicycle (2) and spoon (3) images reprinted under Creative Commons attribution, (1) Title: Elephant Addo, Author: Mikefairbanks, Source, CC BY 2.0 (2) Title: Dessert Spoon, Author: Donovan Govan, Source, CC BY-SA 3.0 (3) Title: Electric Bicycle, Author: Mikefairbanks, Source, CC BY-SA 3.0. Figure 2. Overview of an end-to-end deep hashing architecture. This figure illustrates a simplified recreation of the Deep Cross-Modal Hashing (DCMH) [9] network architecture (CNN: Convolutional Neural Network, BOW: bag of words, FC: Fully Connected layers). Example elephant (1), bicycle (2) and spoon (3) images reprinted under Creative Commons attribution, (1) Title: Elephant Addo, Author: Mikefairbanks, Source, CC BY 2.0 (2) Title: Dessert Spoon, Author: Donovan Govan, Source, CC BY-SA 3.0 (3) Title: Electric Bicycle, Author: Mikefairbanks, Source, CC BY-SA 3.0. Jimaging 08 00328 g002 Figure 3. Simplified workflow of adversarial-based CMH methods depicting approaches used by methods such as Deep Adversarial Discrete Hashing (DADH) [12] and Adversary Guided Asymmetric Hashing (AGAH) [10] (CNN: Convolutional Neural Network, BOW: bag of words, FC: Fully Connected layers). Figure 3. Simplified workflow of adversarial-based CMH methods depicting approaches used by methods such as Deep Adversarial Discrete Hashing (DADH) [12] and Adversary Guided Asymmetric Hashing (AGAH) [10] (CNN: Convolutional Neural Network, BOW: bag of words, FC: Fully Connected layers). Jimaging 08 00328 g003

Figure 4. Unpaired Multi-Modal Learning (UMML) framework workflow. The diagram shows an example of 50% of images being unpaired where 50% text Bag of Words (BoW) binary vectors are emptied. Similarly, in the case of text being unpaired, the image feature matrices would be emptied (CNN: convolutional neural network).

Figure 4. Unpaired Multi-Modal Learning (UMML) framework workflow. The diagram shows an example of 50% of images being unpaired where 50% text Bag of Words (BoW) binary vectors are emptied. Similarly, in the case of text being unpaired, the image feature matrices would be emptied (CNN: convolutional neural network).

Jimaging 08 00328 g004

Figure 5. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired images, i.e., images with no corresponding text. The ‘Paired’ points show results when training with a fully paired training set. Subsequent points show results with increasing amounts of unpaired images in the training set in increments of 20%.

Figure 5. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired images, i.e., images with no corresponding text. The ‘Paired’ points show results when training with a fully paired training set. Subsequent points show results with increasing amounts of unpaired images in the training set in increments of 20%.

Jimaging 08 00328 g005

Figure 6. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired text, i.e., text with no corresponding images. The ‘Paired’ points show results when training with a fully paired training set. Subsequent points show results with increasing amounts of unpaired text in the training set in increments of 20%.

Figure 6. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired text, i.e., text with no corresponding images. The ‘Paired’ points show results when training with a fully paired training set. Subsequent points show results with increasing amounts of unpaired text in the training set in increments of 20%.

Jimaging 08 00328 g006

Figure 7. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired images and text, i.e., images with no corresponding text and vice versa. The ‘Paired’ points show results when training with a fully paired training set. Subsequent points show results with increasing amounts of unpaired images and text in the training set, for example, ‘10%/10%’ refers to 10% of the training set being unpaired images and another 10% being unpaired text for a total of 20% of the dataset being unpaired samples.

Figure 7. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired images and text, i.e., images with no corresponding text and vice versa. The ‘Paired’ points show results when training with a fully paired training set. Subsequent points show results with increasing amounts of unpaired images and text in the training set, for example, ‘10%/10%’ refers to 10% of the training set being unpaired images and another 10% being unpaired text for a total of 20% of the dataset being unpaired samples.

Jimaging 08 00328 g007

Figure 8. In (a), 20% of the training set was discarded. In (b), 20% of the training set was unpaired. In this example, for both (a,b), the model will be trained on 8000 paired samples. However, (b) will also train with its additional 2000 unpaired samples. This way, the effect of training with or without the additional unpaired samples can be investigated.

Figure 8. In (a), 20% of the training set was discarded. In (b), 20% of the training set was unpaired. In this example, for both (a,b), the model will be trained on 8000 paired samples. However, (b) will also train with its additional 2000 unpaired samples. This way, the effect of training with or without the additional unpaired samples can be investigated.

Jimaging 08 00328 g008

Figure 9. Results (mAP) on MIR-Flickr25K and NUS-WIDE with sample discarding, i.e., training set being reduced. The ‘Full’ points show results when training with the full unaltered training set. Subsequent points show results with decreasing amounts of samples, where the given percentage denotes the percentage of samples in the training set which have been discarded. The ‘Random’ points hold the baseline random performance values.

Figure 9. Results (mAP) on MIR-Flickr25K and NUS-WIDE with sample discarding, i.e., training set being reduced. The ‘Full’ points show results when training with the full unaltered training set. Subsequent points show results with decreasing amounts of samples, where the given percentage denotes the percentage of samples in the training set which have been discarded. The ‘Random’ points hold the baseline random performance values.

Jimaging 08 00328 g009 Figure 10. Percentage of performance change of DADH computed using formula (4) when training with unpaired samples compared to paired training across 24 classes of MIR-Flickr25K. Red bars show the five classes with the most performance change and green bars show the five classes with the least performance change. The remaining classes are marked as blue bars. Figure 10. Percentage of performance change of DADH computed using formula (4) when training with unpaired samples compared to paired training across 24 classes of MIR-Flickr25K. Red bars show the five classes with the most performance change and green bars show the five classes with the least performance change. The remaining classes are marked as blue bars. Jimaging 08 00328 g010

Table 1. MIRFlickr-25K and NUS-Wide dataset characteristics.

Table 1. MIRFlickr-25K and NUS-Wide dataset characteristics.

DatasetTrainQueryRetrievalMIRFlickr-25K10,000200018,015NUS-Wide10,0002100193,734 Table 2. Example of images, paired tags, and labels from the MIR-Flickr25K and NUS-WIDE datasets. Example images (1) and (2) reprinted under Creative Commons attribution, (1) Author: Martin P. Szymczak, Source, CC BY-NC-ND 2.0 (2) Title: Squirrel, Author: likeaduck, Source, CC BY 2.0. Table 2. Example of images, paired tags, and labels from the MIR-Flickr25K and NUS-WIDE datasets. Example images (1) and (2) reprinted under Creative Commons attribution, (1) Author: Martin P. Szymczak, Source, CC BY-NC-ND 2.0 (2) Title: Squirrel, Author: likeaduck, Source, CC BY 2.0. ImageTagLabel/ClassMIR-Flickr25K example (1) Jimaging 08 00328 i001bilbao, 11–16, cielo, sky, polarizado, reflejo, reflection, sanidad, estrenandoMiRegalito, geotagged, geo:lat = 43.260867, geo:lon = −2.935705,clouds, sky, structuresNUS-Wide example (2) Jimaging 08 00328 i002cute, nature, squirrel, funny, boxer, boxing, cuteness, coolest, pugnacious, peopleschoice, naturesfinest, blueribbonwinner, animalkingdomelite, mywinners, abigfave, superaplus aplusphoto, vimalvinayan, natureoutpostAnimal, Nature

Table 3. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired images, i.e., images with no corresponding text. Column ‘Paired’ shows results when training with a fully paired training set. Subsequent columns show results with increasing amounts of unpaired images in the training set.

Table 3. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired images, i.e., images with no corresponding text. Column ‘Paired’ shows results when training with a fully paired training set. Subsequent columns show results with increasing amounts of unpaired images in the training set.

MIR-Flickr25KNUS-WIDETaskMethodPaired20%40%60%80%100%Paired20%40%60%80%100%i→tDADH0.8360.8070.7890.7500.7020.5620.7010.6900.6830.6560.6460.297AGAH0.8030.7520.7290.6950.6370.5350.6330.6210.5830.5870.5030.267JDSH0.6720.6530.6480.6430.6190.5550.5460.5340.5100.4570.4020.253t→iDADH0.8230.8240.8140.8120.7960.5520.7070.7060.7020.6700.6340.261AGAH0.7900.7900.7860.7790.7420.5400.6460.5950.5910.5960.4010.277JDSH0.6600.6720.6660.6520.6320.5640.5660.4990.4760.4520.4120.256

Table 4. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired text, i.e., text with no corresponding images. Column ‘Paired’ shows results when training with a fully paired training set. Subsequent columns show results with increasing amounts of unpaired text in the training set.

Table 4. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired text, i.e., text with no corresponding images. Column ‘Paired’ shows results when training with a fully paired training set. Subsequent columns show results with increasing amounts of unpaired text in the training set.

MIR-Flickr25KNUS-WIDETaskMethodPaired20%40%60%80%100%Paired20%40%60%80%100%i→tDADH0.8360.8310.8310.8260.8200.5250.7010.7000.6960.6830.6740.282AGAH0.8030.7550.7400.7200.6820.5410.6330.5970.5660.5000.3560.267JDSH0.6720.6460.6210.6080.5800.5530.5460.5150.4780.3930.3420.254t→iDADH0.8230.8030.7830.7560.7110.5450.7070.7050.7240.6970.6980.274AGAH0.7900.7600.7440.6980.6420.5350.6460.6450.6530.6510.4640.267JDSH0.6600.6530.6220.6310.6010.5450.5660.5200.5060.4680.4200.249

Table 5. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired images and text, i.e., images with no corresponding text and vice versa. Column ‘Paired’ shows results when training with a fully paired training set. Subsequent columns show results with increasing amounts of unpaired images and text in the training set, for example, ‘10% 10%’ refers to 10% of the training set being unpaired images (UI) and another 10% being unpaired text (UT) for a total of 20% of the dataset being unpaired samples.

Table 5. Results (mAP) on MIR-Flickr25K and NUS-WIDE with unpaired images and text, i.e., images with no corresponding text and vice versa. Column ‘Paired’ shows results when training with a fully paired training set. Subsequent columns show results with increasing amounts of unpaired images and text in the training set, for example, ‘10% 10%’ refers to 10% of the training set being unpaired images (UI) and another 10% being unpaired text (UT) for a total of 20% of the dataset being unpaired samples.

MIR-Flickr25KNUS-WIDETaskMethodPairedUI:
UT:10%
10%20%
20%30%
30%40%
40%50%
50%PairedUI:
UT:10%
10%20%
20%30%
30%40%
40%50%
50%i→tDADH0.836 0.8200.8220.7520.7280.7600.701 0.6960.6760.6630.6760.662AGAH0.803 0.7410.7370.6640.6730.6930.633 0.6420.6370.5610.5640.567JDSH0.672 0.6520.6430.6090.6100.5910.546 0.5470.5030.3980.3060.259t→iDADH0.823 0.8080.8010.7630.7730.7620.707 0.6940.7160.7040.7030.698AGAH0.790 0.7710.7620.7450.7350.7290.646 0.6660.6420.5970.5600.565JDSH0.660 0.6540.6500.6090.6170.5940.566 0.5260.4980.4380.3490.255

Table 6. Results (mAP) on MIR-Flickr25K and NUS-WIDE with sample discarding, i.e., training set being reduced. Column ‘Full’ shows results when training with full training set without any sample discarding. Subsequent columns show results with decreasing amounts of samples, where the given percentage denotes the percentage of samples in the training set which have been discarded. The ‘Random’ column holds the baseline random performance values.

Table 6. Results (mAP) on MIR-Flickr25K and NUS-WIDE with sample discarding, i.e., training set being reduced. Column ‘Full’ shows results when training with full training set without any sample discarding. Subsequent columns show results with decreasing amounts of samples, where the given percentage denotes the percentage of samples in the training set which have been discarded. The ‘Random’ column holds the baseline random performance values.

MIR-Flickr25KNUS-WIDETaskMethodFull20%40%60%80%RandomFull20%40%60%80%Randomi→tDADH0.8360.8240.7990.7790.7440.5430.7010.6830.6480.6100.5750.260AGAH0.8030.7630.7370.7140.6780.5480.6330.6330.5880.4400.3660.267JDSH0.6720.6570.6550.6400.6340.5510.5460.5430.5230.4690.4570.256t→iDADH0.8230.8070.7970.7810.7540.5370.7070.6720.6630.6300.5490.258AGAH0.7900.7790.7780.7560.7300.5380.6460.5670.5470.4880.3770.267JDSH0.6600.6690.6540.6540.6440.5590.5660.5140.5170.4870.4240.245 Table 7. The sampling cases that produced the best retrieval results are indicated by UI: Unpaired Image, UT: Unpaired Text, UIT: Unpaired Image and Text, and SD: Sample discarding. The percentage shown in the brackets is the performance difference by which a given unpaired sample case (shown in Table 3, Table 4 and Table 5) outperformed sample discarding (SD) (shown in Table 6). The first row shows the percentage of training samples being unpaired (UI, UT, UIT), or discarded (SD) depending on the cell value. Table 7. The sampling cases that produced the best retrieval results are indicated by UI: Unpaired Image, UT: Unpaired Text, UIT: Unpaired Image and Text, and SD: Sample discarding. The percentage shown in the brackets is the performance difference by which a given unpaired sample case (shown in Table 3, Table 4 and Table 5) outperformed sample discarding (SD) (shown in Table 6). The first row shows the percentage of training samples being unpaired (UI, UT, UIT), or discarded (SD) depending on the cell value. MIR-Flickr25KTaskMethod20%40%60%80%100%i→tDADHUT (+0.86%)UT (+3.97%)UT (+6.02%)UT (+10.16%)UIT (+39.93%)AGAHSDUT (+0.35%)UT (+0.87%)UT (+0.58%)UIT (+26.43%)JDSHSDSDSDSDUIT (+7.26%)t→iDADHUI (+2.02%)UI (+2.16%)UI (+4.04%)UI (+5.57%)UIT (+41.93%)AGAHUI (+1.52%)UI (+0.95%)UI (+2.98%)UI (+1.67%)UIT (+35.5%)JDSHUI (+0.45%)UI (+1.83%)SDSDUIT (+6.26%)Both
TasksDADHUT (+0.19%)UIT (+1.74%)SDUT (+2.20%)UIT (+40.93%)AGAHSDSDUI (+0.24%)SDUIT (+30.92%)JDSHSDUI (+0.38%)UI (+0.08%)SDUIT (+6.76%) NUS-WIDETaskMethod20%40%60%80%100%i→tDADHUT (+2.52%)UT (+7.49%)UT (+11.98%)UT (+17.12%)UIT (+154.54%)AGAHUIT (+1.36%)UIT (+8.46%)UI (+33.40%)UIT (+54.17%)UIT (+112.43%)JDSHSDSDSDSDSDt→iDADHUT (+5.09%)UT (+9.15%)UIT (+11.65%)UIT (+28.05%)UIT (+170.35%)AGAHUIT (+17.58%)UT (+17.19%)UT (+26.61%)UT (+52.36%)UIT (+111.42%)JDSHUT (+1.17%)SDSDSDSDBoth
TasksDADHUT (+3.70%)UT (+8.33%)UT (+11.25%)UIT (+22.67%)UIT (+162.41%)AGAHUIT (+9.02%)UIT (+12.75%)UI (+27.49%)UIT (+51.35%)UIT (+111.92%)JDSHSDSDSD

留言 (0)

沒有登入
gif