Action recognition based on discrete cosine transform by optical pixel-wise encoding

Xi,j,k=∑n=0N−1fxi,j,ncosπNn+12k,f=Zavgc denotes the channel descriptor,Zavgc=1H×W∑j=1H∑k=1Wxc,j,k.(5)The final output of the selector is computed aswhere ⊙ denotes the element-wise product.The joint loss function is as follows:L=LAcc+λ⋅∑c=1CMc,(7)where LAcc is the loss for the action recognition model, and λ is a hyperparameter indicating the relative importance of the frequency channel selector.

III. RESULTS

Section:

ChooseTop of pageABSTRACTI. INTRODUCTIONII. METHODSIII. RESULTS <<IV. CONCLUSIONREFERENCESPrevious sectionNext sectionIn the simulations, we trained and evaluated the proposed OSAR on the Kungliga Tekniska högskolan (KTH) dataset.1717. C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A local SVM approach,” in Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004 (IEEE, 2004), Vol. 3, pp. 32–36. KHT is a dataset that contains six types of human actions (i.e., walking, jogging, running, boxing, hand waving, and hand clapping) performed several times by 25 subjects in four scenarios. The video from the dataset was resized to 171 × 128 pixels and cropped randomly to 112 × 112. In the training phase, the computational-based method was used to generate the T-DCT spectrum. The video length was set to 16, which means the T-DCT spectrum had 16 frequency channels. In the inference phase, DCT-Cam was used to capture the encoded spectrum of the dynamic scene. Notably, the inference data volume was smaller than that of training data because only channels selected by the frequency channel selector would be acquired. To fit the dimension of training inputs, the other channel components were set to zero. To normalize the input channels, we computed the mean and variance of the DCT coefficients for each of the 16 frequency channels separately from all the training videos. The whole model that cascaded the frequency channel selector and C2D was trained simultaneously. The score of each input frequency channel was initialized as 1, and the parameter λ in the joint loss was set to 0.1. The model was learned by the stochastic gradient descent (SGD) optimizer for 200 epochs. SGD was applied with an initial learning rate of 0.1, a momentum of 0.9, and a weight decay of 4 × 10−5. The learning rate decayed by 0.1 every 50 epochs.Figure 4(a) shows the T-DCT spectrum captured by DCT-Cam. We sampled 16 frequencies from the video of the “Playing Basketball” scene in the dataset. It’s obvious that only the base frequency component shows static object features, while other components contain motion information. As shown in Fig. 4(a), the global average spectrum coefficient decreases as frequency increases, indicating that motion information is concentrated on low-frequency modes. Moreover, the objects in the scene with different motions are divided into different spectra. Specifically, in Fig. 4(b), the slow-moving basketball coach is more distributed in the lower first spectrum. In comparison, the fast-moving basketball is more distributed in the higher eighth spectrum. Thus, the T-DCT data are a good representation of dynamic scenes.

We compared the proposed OSAR with three competitive temporal methods on the KTH dataset: single-shot long-exposure images, single-shot short-exposure images, and original videos. We used the long-exposure and the short-exposure images as two different single-shot methods based on high-speed and low-speed cameras, which had the same data size as the encoded spectrum captured by DCT-Cam. Long exposure images contain aliased motion information over a period, while the spatial features may be corrupted. In contrast, short-exposure images describe static object features but lose temporal information. The original video was used as the upper bound of the data size and spatiotemporal information. All methods were trained in the same way as described previously.

The long exposure image Ile was generated bywhere It is the frame in the video, n is the length of the video.The short exposure image Ise was generated byIse=∑t=0n−1δt,n+12It,(9)where δx,y denotes the two dimensional Dirac delta function.In Fig. 5, we report our experimental results. As the upper bound of conventional camera methods, the video method achieved an accuracy of 85.55%. However, the single-shot short-exposure image based on a high-speed camera has the lowest performance of 76.04%. Although the short-exposure method can obtain high-resolution static features, the results show that the loss of motion information causes severe degradation. The single-shot long-exposure image based on a low-speed camera has a much higher performance of 83.98%, which indicates that global motion features are more informative than static object features. Compared to three temporal methods based on traditional cameras, the proposed OSAR method achieves 5.21% higher accuracy at least. Another interesting observation is that the recognition accuracy of the data based on single-shot long exposure images and original videos were almost the same, which means more image sequences could not perform better even if more data were used. The phenomenon indicates that video data have much temporal redundancy and is not reasonable for video understanding tasks. While the T-DCT data works well due to the T-DCT spectrum, it is a more suitable way to represent action information.Figure 6 shows the proposed frequency channel selector results on validation data. The enabled probability of a frequency channel represents the channel’s contribution to the action recognition performance. The results illustrate that the base frequency channel is the most important and surpasses other channels. The other channels perform differently and almost follow the low-frequency energy concentration characteristic. With the observation of energy usually concentrated on low-frequency modes, we discard some high-frequency channels and only preserve low-frequency channels. The results are shown in Table II. As the highest frequency channel from 16 decreases to half of the previous (8, 4, 2, 1) step by step, the recognition accuracy decreases (0.8%, 4.4%, 1.2%, 0.1%, respectively). Therefore, we can only use lower spectrum data to achieve higher performance in a data volume-limited system. Table II shows more details of the experiment results. Notably, the result of only four frequency channels being preserved achieved almost the same accuracy as 16 video channels (see the boldface raw of Table II). Therefore, compared to the traditional camera method, the proposed OSAR can reduce 75% of the data requirements while maintaining the same accuracy on the KTH dataset, which shows the huge advantage of OSAR in video understanding tasks.Table icon

TABLE II. Action recognition results on the KTH dataset. Boldface denotes the T-DCT method with recognition accuracy comparable to the video method.

InputOptical encoderExposureInput data sizeNetworkNo. of parametersFLOPsTop-1 (%)Improvement (%)Data reduction (%)VideoShort16C2D59.99M2.02G85.55T-DCT(ours)✓Long16C2D59.99M2.02G90.765.210✓Long8C2D59.98M1.96G89.974.4250✓Long4C2D59.98M1.94G85.54−0.0175✓Long2C2D59.98M1.92G84.37−1.1888✓Long1C2D59.98M1.91G84.24−1.3194ImageLong1C2D59.98M1.91G83.98−1.5794ImageShort1C2D59.98M1.91G76.04−9.5194We built a hardware prototype of OSAR for the actual experiment, as shown in Fig. 7. In the prototype, the camera lens is a CHIOPT HC3505A model, and the relay lens is a Thorlabs MAP10100100-A model. The DMD is a ViALUX V-9001 model. The image sensor (FLIR GS3-U3-120S6M-C) is equipped with a zoom lens (Utron VTL0714V). In the experiment, we randomly selected ten labels from 100 videos from the UCF101 dataset.5151. K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv:1212.0402 (2012). The scene video was displayed on a monitor as the input of OSAR. 16 DCT basis were loaded to DMD as the encoding patterns and switched synchronously with the frame sequence. We set the exposure time to an encoding cycle. Then an encoded spectrum was obtained in a single shot. Figure 8(a) shows several captured spectrum. These spectra were evaluated by a pre-trained classification model. As a result, 62% of the samples were recognized correctly [see Fig. 8(b)], which indicates that the OSAR works well in the actual hardware system.

IV. CONCLUSION

Section:

ChooseTop of pageABSTRACTI. INTRODUCTIONII. METHODSIII. RESULTSIV. CONCLUSION <<REFERENCESPrevious sectionNext section

In summary, a joint optical–electronic processing method for action recognition from a single coded exposure image is proposed and demonstrated. To the best of our knowledge, this is the first work investigating action recognition from the temporal spectrum. This novel architecture effectively reduces the detector bandwidth and data size while improving action recognition accuracy. Compared with high-speed cameras, OSAR achieves higher accuracy with lower computation costs and data size. In addition to demonstrating OSAR on the standard datasets, we built a prototype coded exposure camera and captured a series of scene videos to show that the proposed method also worked in a real scene. Note that the objective of our work is not to develop the state-of-the-art network architecture for action recognition but to propose an insight and solution to combat the data bottleneck. Therefore, the model that is widely used in action recognition is preferred. While other models achieve better accuracy on the KTH dataset, we chose this framework for its accessibility and as a good baseline representation of a canonical recognition framework. Although we use the C2D for the joint optimization demonstration in this paper, since the structure of the action recognition module is not modified, it can be flexibly switched to the state-of-the-art performance action recognition models.

ACKNOWLEDGMENTS

This work was supported by the National Natural Science Foundation of China (Grant No. 62135009), the National Key Research and Development Program of China (Grant No. 2019YFB1803500), and Huawei Technologies Co., Ltd.

Conflict of Interest

The authors have no conflicts to disclose.

Author Contributions

Yu Liang: Conceptualization (equal); Data curation (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Validation (equal); Visualization (equal); Writing – original draft (equal); Writing – review & editing (equal). Honghao Huang: Conceptualization (equal); Methodology (equal); Resources (equal); Writing – review & editing (equal). Jingwei Li: Funding acquisition (equal); Project administration (equal); Supervision (equal). Xiaowen Dong: Funding acquisition (equal); Project administration (equal); Supervision (equal). Minghua Chen: Supervision (equal). Sigang Yang: Supervision (equal). Hongwei Chen: Formal analysis (equal); Funding acquisition (equal); Methodology (equal); Project administration (equal); Resources (equal); Supervision (equal); Writing – review & editing (equal).

The data that support the findings of this study are available from the corresponding author upon reasonable request.

REFERENCES

Section:

ChooseTop of pageABSTRACTI. INTRODUCTIONII. METHODSIII. RESULTSIV. CONCLUSIONREFERENCES <<Previous section1. V. Sharma, M. Gupta, A. Kumar, and D. Mishra, “Video processing using deep learning techniques: A systematic literature review,” IEEE Access 9, 139489 (2021). https://doi.org/10.1109/access.2021.3118541, Google ScholarCrossref2. N. Gupta, S. K. Gupta, R. K. Pathak, V. Jain, P. Rashidi, and J. S. Suri, “Human activity recognition in artificial intelligence framework: A narrative review,” Artif. Intell. Rev. 55, 4755 (2022). https://doi.org/10.1007/s10462-021-10116-x, Google ScholarCrossref3. P. Pareek and A. Thakkar, “A survey on video-based human action recognition: Recent updates, datasets, challenges, and applications,” Artif. Intell. Rev. 54, 2259–2322 (2021). https://doi.org/10.1007/s10462-020-09904-8, Google ScholarCrossref4. A. Sargano, P. Angelov, and Z. Habib, “A comprehensive review on handcrafted and learning-based action representation approaches for human activity recognition,” Appl. Sci. 7, 110 (2017). https://doi.org/10.3390/app7010110, Google ScholarCrossref5. D. Li, R. Wang, P. Chen, C. Xie, Q. Zhou, and X. Jia, “Visual feature learning on video object and human action detection: A systematic review,” Micromachines 13, 72 (2021). https://doi.org/10.3390/mi13010072, Google ScholarCrossref6. Y. Cui, L. Yan, Z. Cao, and D. Liu, “TF-blender: Temporal feature blender for video object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (IEEE, 2021), pp. 8138–8147. Google ScholarCrossref7. M. Gao, F. Zheng, J. J. Yu, C. Shan, G. Ding, and J. Han, “Deep learning for video object segmentation: A review,” Artif. Intell. Rev. 55, 1–75 (2022). https://doi.org/10.1007/s10462-022-10176-7, Google ScholarCrossref8. H. K. Cheng, Y.-W. Tai, and C.-K. Tang, “Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2021), pp. 5559–5568. Google ScholarCrossref9. G. Ciaparrone, F. Luque Sánchez, S. Tabik, L. Troiano, R. Tagliaferri, and F. Herrera, “Deep learning in video multi-object tracking: A survey,” Neurocomputing 381, 61–88 (2020). https://doi.org/10.1016/j.neucom.2019.11.023, Google ScholarCrossref, ISI10. Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “FairMOT: On the fairness of detection and re-identification in multiple object tracking,” Int. J. Comput. Vision 129, 3069–3087 (2021). https://doi.org/10.1007/s11263-021-01513-4, Google ScholarCrossref11. Y.-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar, “Rethinking the faster R-CNN architecture for temporal action localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2018), pp. 1130–1139. Google ScholarCrossref12. C. Lin, C. Xu, D. Luo, Y. Wang, Y. Tai, C. Wang, J. Li, F. Huang, and Y. Fu, “Learning salient boundary feature for anchor-free temporal action localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2021), pp. 3320–3329. Google ScholarCrossref13. S. Yan, X. Xiong, A. Arnab, Z. Lu, M. Zhang, C. Sun, and C. Schmid, “Multiview transformers for video recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2022), pp. 3333–3343. Google ScholarCrossref14. J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 6299–6308. Google ScholarCrossref15. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2014), pp. 1725–1732. Google ScholarCrossref16. S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 35, 221–231 (2012). https://doi.org/10.1109/TPAMI.2012.59, Google ScholarCrossref17. C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A local SVM approach,” in Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004 (IEEE, 2004), Vol. 3, pp. 32–36. Google ScholarCrossref18. U. Demir, Y. S. Rawat, and M. Shah, “TinyVIRAT: Low-resolution video action recognition,” in 2020 25th International Conference on Pattern Recognition (ICPR) (IEEE, 2021), pp. 7387–7394. Google ScholarCrossref19. M. Xu, A. Sharghi, X. Chen, and D. J. Crandall, “Fully-coupled two-stream spatiotemporal networks for extremely low resolution action recognition,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) (IEEE, 2018), pp. 1607–1615. Google ScholarCrossref20. A. Srivastava, O. Dutta, J. Gupta, S. Agarwal, and P. AP, “A variational information bottleneck based method to compress sequential networks for human action recognition,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (IEEE, 2021), pp. 2745–2754. Google ScholarCrossref21. Y. Huo, X. Xu, Y. Lu, Y. Niu, M. Ding, Z. Lu, T. Xiang, and J.-r. Wen, “Lightweight action recognition in compressed videos,” in European Conference on Computer Vision (Springer, 2020), pp. 337–352. Google ScholarCrossref22. Z. Zhang, Z. Kuang, P. Luo, L. Feng, and W. Zhang, “Temporal sequence distillation: Towards few-frame action recognition in videos,” in Proceedings of the 26th ACM International Conference onMultimedia (ACM, 2018), pp. 257–264. Google ScholarCrossref23. Y. Meng, C.-C. Lin, R. Panda, P. Sattigeri, L. Karlinsky, A. Oliva, K. Saenko, and R. Feris, “AR-Net: Adaptive frame resolution for efficient action recognition,” in European Conference on Computer Vision (Springer, 2020), pp. 86–104. Google ScholarCrossref24. W. Wu, D. He, X. Tan, S. Chen, and S. Wen, “Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (IEEE, 2019), pp. 6222–6231. Google ScholarCrossref25. S. N. Gowda, M. Rohrbach, and L. Sevilla-Lara, “Smart frame selection for action recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI Press, 2021), Vol. 35, Issue 2, pp. 1451–1459. Google Scholar26. R. G. Baraniuk, T. Goldstein, A. C. Sankaranarayanan, C. Studer, A. Veeraraghavan, and M. B. Wakin, “Compressive video sensing: Algorithms, architectures, and applications,” IEEE Signal Process. Mag. 34, 52–66 (2017). https://doi.org/10.1109/msp.2016.2602099, Google ScholarCrossref, ISI27. E. J. Candès and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal Process. Mag. 25, 21–30 (2008). https://doi.org/10.1109/msp.2007.914731, Google ScholarCrossref, ISI28. C. Hu, H. Huang, M. Chen, S. Yang, and H. Chen, “Video object detection from one single image through opto-electronic neural network,” APL Photonics 6, 046104 (2021). https://doi.org/10.1063/5.0040424, Google ScholarScitation, ISI29. T. Okawara, M. Yoshida, H. Nagahara, and Y. Yagi, “Action recognition from a single coded image,” in 2020 IEEE International Conference on Computational Photography (ICCP) (IEEE, 2020), pp. 1–11. Google ScholarCrossref30. C. Hu, H. Huang, M. Chen, S. Yang, and H. Chen, “FourierCam: A camera for video spectrum acquisition in a single shot,” Photonics Res. 9, 701–713 (2021). https://doi.org/10.1364/prj.412491, Google ScholarCrossref31. Z. Zhang, X. Wang, G. Zheng, and J. Zhong, “Fast Fourier single-pixel imaging via binary illumination,” Sci. Rep. 7, 12029 (2017). https://doi.org/10.1038/s41598-017-12228-3, Google ScholarCrossref32. N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,” IEEE Trans. Comput. C-23, 90–93 (1974). https://doi.org/10.1109/t-c.1974.223784, Google ScholarCrossref, ISI33. K. R. Rao and P. Yip, Discrete Cosine Transform: Algorithms, Advantages, Applications (Academic Press, 2014). Google Scholar34. M. Barbero, H. Hofmann, and N. Wells, “DCT source coding and current implementations for HDTV,” EBU Tech. Rev. 251, 22–33 (1992). Google Scholar35. W. Lea, Video on Demand (House of Commons Library, 1994). Google Scholar36. K. Xu, M. Qin, F. Sun, Y. Wang, Y.-K. Chen, and F. Ren, “Learning in the frequency domain,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2020), pp. 1740–1749. Google ScholarCrossref37. L. Jiang, B. Dai, W. Wu, and C. C. Loy, “Focal frequency loss for image reconstruction and synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (IEEE, 2021), pp. 13919–13929. Google ScholarCrossref38. J. F. Blinn, “What’s that deal with the DCT?,” IEEE Comput. Graphics Appl. 13, 78–83 (1993). https://doi.org/10.1109/38.219457, Google ScholarCrossref

留言 (0)

沒有登入
gif