Viewpoint planning with transition management for active object recognition

Visual object recognition has a wide range of applications e.g., automatic driving (Behl et al., 2017), robotics (Stria and Hlavác, 2018), medical diagnostic (Duan et al., 2019), environmental perception (Roynard et al., 2018), etc. Most recognition systems merely take a single viewpoint image as input and produce a category label estimate as output (Jayaraman and Grauman, 2019). It is prone to the recognition errors when the image can not provide sufficient information. In contrast, the visual behavior of people is an active process so as to more clearly perceive their surroundings. As shown in Figure 1, in daily life, people can intelligently observe an object from different viewpoints to determine the identity of the object. Similarly, if the viewpoint of an agent can be adjusted (e.g., mobile robots and autonomous vehicles), more valuable information will be obtained to boost the recognition performance.

As a branch of active vision (Parr et al., 2021), active object recognition (AOR) (Patten et al., 2015; Wu et al., 2015; Potthast et al., 2016; Van de Maele et al., 2022) is a typical technology to realize the above idea, which aims to collect additional clues by purposefully changing the viewpoint of an agent to improve the quality of recognition. Andreopoulos and Tsotsos (2013) and Zeng et al. (2020) review a series of classical AOR methods. One of the most concerned problems in AOR is viewpoint planning (VP) that refers to developing a policy to determine the next viewpoints of the agent. In recent years, researchers mainly focus on using reinforcement learning to solve the VP problem (Becerra et al., 2014; Malmir et al., 2015; Malmir and Cottrell, 2017; Liu et al., 2018a), namely to use the viewpoint transitions explored by the agent to train the VP policy. Becerra et al. (2014) formally define object recognition as a partially observable Markov decision process problem and uses stochastic dynamic programming to address the problem. As a pioneering work, Malmir et al. (2015) provide a public AOR dataset called GERMS that includes 136 objects with different view images and develops a deep Q-learning (DQL) system to learn to actively verify objects by using standard back-propagation and Q-learning. In the same way, Liu et al. (2018a) design a hierarchical local-receptive-field architecture to predict object label and learns a VP policy by combining extreme learning machine and Q-learning. Similar to Becerra et al. (2014), AOR is also modeled as a partially observable Markov decision process by Malmir and Cottrell (2017). The difference is that a belief tree search is built to find near-optimal action values which correspond to the next best viewpoints. These VP methods explore discrete viewpoint space, which may introduce significant quantization errors. Hence, Liu et al. (2018b) present a continuous VP method based on trust region policy optimization (TRPO) (Schulman et al., 2015) and adopts extreme learning machine (Huang et al., 2006) to reduce computational complexity. It shows a promising result on the GERMS dataset compared to the discrete VP methods. However, due to the on-policy characteristic of TRPO, the trained viewpoint transitions will be discarded by the agent, which may lead to an inefficient use of the explored transitions.

The deterministic policy gradient theory (Silver et al., 2014) is proposed for reinforcement learning with continuous actions and introduces an off-policy actor-critic algorithm (OPDAC-Q) to learn a deterministic target policy. Lillicrap et al. (2015) present a deep deterministic policy gradient (DDPG) approach that combines deterministic policy gradient with DQN (Mnih et al., 2013, 2015) to learn policies in high-dimensional continuous action spaces. Fujimoto et al. (2018) contribute a mechanism that takes the minimum value between a pair of critics in the actor-critic algorithm of Silver et al. (2014) to tackle the function approximation errors. The deterministic policy gradient theory has been widely applied in various fields, such as electricity market (Liang et al., 2020), vehicle speed tracking control (Hao et al., 2021), fuzzy PID controller (Shi et al., 2020), quadrotor control (Wang et al., 2020), energy efficiency (Zhang et al., 2020), and autonomous underwater vehicles (Sun et al., 2020; Wu et al., 2022). However, to our best knowledge, it has never been employed in the AOR task.

In this work, we present a novel continuous VP method with transition management based on reinforcement learning. This method can efficiently use the explored viewpoint transitions to learn the continuous VP policy. Concretely, a learning framework of the continuous VP policy is established using the deterministic policy gradient theory, which provides an opportunity to reuse the explored transitions owing to the off-policy characteristic of the theory. Then, we design a scheme of viewpoint transition management that can store the explored transitions and decide which transitions are used for the policy learning. The scheme is implemented by introducing and improving the prioritized experience replay technology (Schaul et al., 2016). The improvements include: (1) We improve the estimation approach of temporal difference (TD) error with the clipped double Q-learning algorithm (Fujimoto et al., 2018) so as to adapt to our continuous VP framework. (2) We utilize importance-sampling to correct the estimation bias of TD error produced by the prioritized replay. Finally, within the framework, we develop an algorithm based on twin delayed deep deterministic policy gradient (TD3) (Fujimoto et al., 2018) and the designed scheme to train the continuous VP policy. Experimental results on the public dataset GERMS demonstrate the effectiveness of the proposed VP method.

• A novel continuous VP method with transition management for AOR is presented to solve the problem of inefficient use of the explored viewpoint transitions in the existing continuous VP method.

• We establish a learning framework of the continuous VP policy via the deterministic policy gradient theory.

• A scheme of viewpoint transition management is designed, which is implemented by introducing and improving the prioritized experience replay technology.

• We develop an algorithm based on twin delayed deep deterministic policy gradient and the designed scheme to train the continuous VP policy.

The rest of this paper is structured as follows: Section 2 formulates the VP problem. Section 3 details the proposed framework for the solution of the problem. Finally, the implementation and experimental results, as well as conclusions are further provided in Sections 4, 5.

An AOR system mounted on an automatic mobile agent allows the agent to identify an object by dealing with the images captured from different viewpoints. Suppose at the initial time t = 0, an object to be identified is given from an object library containing M objects and the agent captures an image IΦ0 from the initial viewpoint Φ0. The classifier C(·) in the AOR system will give a probability prediction C(IΦ0) of the object according to the image IΦ0. C(IΦ0) is a M dimensional vector where every element denotes recognition probability of different objects in the library. When the prediction is uncertain [i.e., the maximum probability in C(IΦ0) is less than the preset threshold], the agent will move to explore more viewpoints to improve recognition performance. This requires the system plans a relative movement action at for the agent to obtain a new viewpoint Φt+1 = Φt + at. The new image IΦt+1 captured from the viewpoint Φt+1 will be used for the recognition again. This process is repeated several times until a stop condition (e.g., planning up to Tmax time steps or reaching the preset probability threshold) is reached.

An undesirable planning action may make it difficult for the agent to capture useful images for recognition. Therefore, we need to find an effective VP policy for the AOR system. For this purpose, the VP problem is considered as a reinforcement learning paradigm which can be formulated as a Markov decision process. The process is described with a six-element tuple <S,A,r,P,γ,u>.

• S represents a set of continuous states in which each state s is produced by the predictions of corresponding images captured from different viewpoints.

• A is a set of continuous actions which are determined by the agent. Each action a in the set is used for the agent to get a new viewpoint.

• r:S×A → ℝ is a reward function designed to evaluate the quality of selecting a viewpoint.

• P:S×A×S→[0,1] denotes the transition probability. It describes the possibility of transferring to the subsequent state s, after the action a is selected in the state s.

• γ ∈ [0, 1] is a discount factor used to adjust the attention between present and future rewards.

• u:S→A is a deterministic continuous VP policy [i.e., a = u(s)] that can generate an action for the agent to get a new viewpoint in a certain state.

The VP problem is transformed to solve the optimal policy u* in the setting of reinforcement learning.

In reinforcement learning, the optimal policy u* can be achieved by maximizing the expected return over all episodes. At any time step t of each episode, with a given state st∈S, the agent plans an action at∈A according to its current policy u (at = u(st)), receiving a reward r(st, at) and the new state st+1~P(st+1|st,at). ((st, at, rt, st+1) is called the viewpoint transition in the AOR task.) The return is defined as the cumulative discounted reward ∑i=tTγi-tr(si,ai) where T is the end time step of planning. Let Qu(st,at) be the expected return when performing action at in state st under the policy u. Qu(st,at) is defined as

which is known as the action value function. u* can be solved by maximizing the expected value of Equation (1) over the whole state space

where d(·) is the state probability density of Markov decision process in steady state distribution (Bellemare et al., 2017).

We assume the deterministic continuous VP policy u is parameterized by θ and denote it as u(s; θ). Naturally, Equation (2) can be transformed to an optimization with respect to θ that maximize the objective

To solve the optimization of Equation (3), the deterministic policy gradient theory (Silver et al., 2014) is introduced to iteratively update the parameters θ by taking the gradient of Equation (3)

We utilize (Equation 4) as a framework to learn the optimal deterministic continuous VP policy u(st;θ*) for AOR. The reason why this framework can reuse the explored viewpoint transitions is the off-policy characteristic of the deterministic policy gradient theory, i.e., the viewpoint transitions explored by any policy can be used for the calculation of the gradient in Equation (4), because the gradient is only related to the distribution of state st (Silver et al., 2014). The pipeline of our AOR is shown in Figure 2 where the VP policy u(st; θ) is represented by a three-layer fully-connected neural network with the parameters θ. The policy network u(st; θ) takes a state st as input and outputs a deterministic action at = u(st; θ). In the following, the representations of state st and reward function r(st, at) will be elaborated. Additionally, we will design a scheme of viewpoint transition management and develop a training algorithm based on twin delayed deep deterministic policy gradient (TD3) (Fujimoto et al., 2018) and the scheme for the learning of u(st;θ*) within the framework.

As shown in Figure 2, we first use a convolutional neural network (CNN) model to extract features from the captured image IΦt and then recognize the concerned objects with a softmax layer added the top of the CNN model. The CNN model and the softmax layer constitute a classifier C(·) which is pre-trained with the images from different viewpoints of the concerned objects. The parameters of the classifier are fixed when training the VP policy network. The classifier outputs a belief vector C(IΦt) where every element denotes recognition probability of different objects. The oth element in the vector is represented as P(o|IΦt) where o = 1, 2, ..., M is the object label. The recognition state st is a posterior probability distribution over different objects at time step t, which is produced by the captured images. It is also expressed as a vector where the oth element is P(o|IΦ0, IΦ1, ..., IΦt), o = 1, 2, ..., M. According to naive Bayes (Paletta and Pinz, 2000), P(o|IΦ0, IΦ1, ..., IΦt) is given as

where ξt is a normalizing coefficient.

Reward function r(st, at) (denoted as rt for simplicity) is used to evaluate the quality of selecting a viewpoint. As described in Section 3.2, state is a posterior probability distribution over different objects. The flatter the distribution is, the stronger the recognition uncertainty is. To quantify the uncertainty, information entropy (Zhao et al., 2016; Liu et al., 2018b) is utilized and the uncertainty in state st is denoted as H(st)=-∑oP(o|IΦ0,IΦ1,...,IΦt)logP(o|IΦ0,IΦ1,...,IΦt). The purpose of AOR is to reduce the uncertainty of recognition through viewpoint planning. Therefore, we can design the reward function according to the change of uncertainty before and after viewpoint selection. The resulting reward function is

rt=. 4.3.1.4. Purposeful continuous VP policy

TRPO policy (Liu et al., 2018b) utilizes trust region policy optimization (Schulman et al., 2015) to learn a continuous VP policy and adopts extreme learning machine (Huang et al., 2006) to reduce computational complexity. This policy has on-policy characteristic that means the agent can not reuse learned viewpoint transitions for efficient training.

Since the main focus of this work is viewpoint planning, we do not investigate the impact of classifiers on recognition performance. Therefore, for a fair comparison, the classifiers in different approaches are the same in the experiment. Figure 5 reports the experimental results of our method against other approaches over 10 random seeds of the policy network initialization. Some observations from Figure 5 are presented as follows: (1) Viewpoint planning can greatly improve recognition performance. The number of VP is 0 that means the agent recognizes the concerned object with a single viewpoint. Obviously, the recognition accuracy of single viewpoint recognition policy is far lower than that of the methods which perform multi viewpoint recognition via VP. This is because more object information with difference can be found through VP to reduce recognition uncertainty, thus improving the recognition performance. As shown in Figure 6, the uncertainty of recognition decreases as the number of viewpoints increases. Figure 7 shows the process of actively identifying an object. (2) The performance of the blind VP policies is nowhere near as good as that of the purposeful VP policies. The primary reason is that the purposeful VP policies (i.e., DQL policy, TRPO policy and our policy) can purposefully plan next viewpoints according to the observed information. (3) The continuous VP policies have better performance than the discrete VP policy. That is because the continuous VP policies (i.e., TRPO policy and our policy) directly explore continuous viewpoint space without sampling, so they will not miss some important viewpoints. (4) The performance of our deterministic continuous VP policy exceeds that of TRPO policy. This is mainly because we design a scheme of viewpoint transition management that can reuse the obtained viewpoint transitions to improve the training effect.

www.frontiersin.org

Figure 5. Performance comparison between our presented deterministic continuous VP approach and several competing methods. The shaded region represents the standard deviation of the average evaluation over 10 trials.

www.frontiersin.org

Figure 6. The average entropy over the whole test dataset. The experiment is implemented with our VP model.

www.frontiersin.org

Figure 7. An example of actively identifying an object by our VP method. The recognition belief increases with the increase of the number of viewpoint planning.

4.3.2. Ablation studies

To verify the importance of different components in our proposed VP model, we intend to conduct the variant experiments with the ablation of different components, i.e., viewpoint transition management (VTM) and bias correction (BC). Training the model without VTM and BC are respectively denoted as Ours-woVTM and Ours-woBC. From the presented results over 10 random seeds in Figure 8, we can notice that: (1) The performance of Ours-woVTM is the worst. It illustrates that our designed scheme of viewpoint transition management indeed enhances the training effect. (2) The performance of Ours-woBC is inferior to that of Ours, especially when the capacity K of the viewpoint transition buffer is large. This is because when the capacity is larger, the distribution of st+1 in the buffer is closer to its true distribution. In this case, the effect of our bias correction based on importance sampling will be more obvious.

www.frontiersin.org

Figure 8. The performance comparison results of ablation experiments. K represents the capacity of the viewpoint transition buffer. The shaded region represents the standard deviation of the average evaluation over 10 trials.

4.3.3. Sampling strategies investigations

To verify the superiority of our proposed sampling strategy (i.e., prioritized experience replay based on clipped double Q-learning and bias correction) in the scheme of viewpoint transition management, we conduct comparison experiments with the uniform sampling strategy (Lin, 1992) over 10 random seeds. As shown in Figure 9, we observe that our sampling strategy achieves a better performance, since the importance of each viewpoint transition is ignored by the uniform sampling strategy.

www.frontiersin.org

Figure 9. Performance comparison between our sampling strategy and uniform sampling strategy. The capacity of the viewpoint transition buffer is 106. The shaded region represents the standard deviation of the average evaluation over 10 trials.

5. Conclusions

In this paper, a continuous viewpoint planning method with transition management is proposed for active object recognition based on reinforcement learning. Specifically, we employ deterministic policy gradient theory to build a learning framework of the viewpoint planning policy. We also design a scheme of viewpoint transition management that can store and reuse the obtained transitions. We develop an algorithm based on twin delayed deep deterministic gradient and the designed scheme to train the policy. Experiments on a public dataset demonstrate the effectiveness of our method. In the future, we will integrate the calibrated probabilistic classifiers in AOR research. As stated in Popordanoska et al. (2022), the way the posterior probability distribution is defined in our work assumes that the classifier is properly calibrated, i.e. the softmax output represents the correct error rate probabilities. In general, this is not necessarily the case.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

HS and YL: conceptualization. FZ and HS: methodology. HS and YK: software. HS and SF: investigation. FZ: resources and funding acquisition. YL: data curation. HS: writing—original draft. YL and PZ: writing—review and editing. JW and YW: supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant No. U1713216.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Andreopoulos, A., and Tsotsos, J. K. (2013). 50 years of object recognition: directions forward. Comput. Vis. Image Understand. 117, 827–891. doi: 10.1016/j.cviu.2013.04.005

CrossRef Full Text | Google Scholar

Becerra, I., Valentin-Coronado, L. M., Murrieta-Cid, R., and Latombe, J.-C. (2014). “Appearance-based motion strategies for object detection,” in 2014 IEEE International Conference on Robotics and Automation (ICRA) (Hong Kong: IEEE), 6455–6461.

Google Scholar

Behl, A., Hosseini Jafari, O., Karthik Mustikovela, S., Abu Alhaija, H., Rother, C., and Geiger, A. (2017). “Bounding boxes, segmentations and object coordinates: how important is recognition for 3d scene flow estimation in autonomous driving scenarios,” in Proceedings of the IEEE International Conference on Computer Vision (Venice: IEEE), 2574–2583.

Google Scholar

Bellemare, M. G., Dabney, W., and Munos, R. (2017). “A distributional perspective on reinforcement learning,” in International Conference on Machine Learning (PMLR), 449–458. doi: 10.48550/arXiv.1707.06887

PubMed Abstract | CrossRef Full Text | Google Scholar

Duan, J., Bello, G., Schlemper, J., Bai, W., Dawes, T., Biffi, C., et al. (2019). Automatic 3D bi-ventricular segmentation of cardiac images by a shape-refined multi- task deep learning approach. IEEE Trans. 38, 2151–2164. doi: 10.1109/TMI.2019.2894322

PubMed Abstract | CrossRef Full Text | Google Scholar

Fujimoto, S., Hoof, H., and Meger, D. (2018). “Addressing function approximation error in actor-critic methods,” in International Conference on Machine Learning (PMLR), 1587–1596. doi: 10.48550/arXiv.1802.09477

CrossRef Full Text | Google Scholar

Hao, G., Fu, Z., Feng, X., Gong, Z., Chen, P., Wang, D., et al. (2021). A deep deterministic policy gradient approach for vehicle speed tracking control with a robotic driver. IEEE Trans. Autom. Sci. Eng. 19, 2514–2525. doi: 10.1109/TASE.2021.3088004

CrossRef Full Text | Google Scholar

Huang, G.-B., Zhu, Q.-Y., and Siew, C.-K. (2006). Extreme learning machine: theory and applications. Neurocomputing 70, 489–501. doi: 10.1016/j.neucom.2005.12.126

CrossRef Full Text | Google Scholar

Kingma, D. P., and Ba, J. (2014). ADAM: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. doi: 10.48550/arXiv.1412.6980

CrossRef Full Text | Google Scholar

Liang, Y., Guo, C., Ding, Z., and Hua, H. (2020). Agent-based modeling in electricity market using deep deterministic policy gradient algorithm. IEEE Trans. Power Syst. 35, 4180–4192. doi: 10.1109/TPWRS.2020.2999536

CrossRef Full Text | Google Scholar

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., et al. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.

PubMed Abstract | Google Scholar

Lin, L.-J. (1992). Reinforcement Learning for Robots Using Neural Networks. Pittsburgh, PA: Carnegie Mellon University.

Google Scholar

Liu, H., Li, F., Xu, X., and Sun, F. (2018a). Active object recognition using hierarchical local-receptive-field-based extreme learning machine. Memet. Comput. 10, 233–241. doi: 10.1007/s12293-017-0229-2

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, H., Wu, Y., and Sun, F. (2018b). Extreme trust region policy optimization for active object recognition. IEEE Trans. Neural Network Learn. Syst. 29, 2253–2258. doi: 10.1109/TNNLS.2017.2785233

PubMed Abstract | CrossRef Full Text | Google Scholar

Malmir, M., and Cottrell, G. W. (2017). “Belief tree search for active object recognition,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE), 4276–4283.

Google Scholar

Malmir, M., Sikka, K., Forster, D., Movellan, J. R., and Cottrell, G. (2015). “Deep q-learning for active recognition of germs: Baseline performance on a standardized dataset for active learning,” in Proceedings of the British Machine Vision Conference (BMVC), 161.1–161.11. doi: 10.5244/C.29.161

CrossRef Full Text | Google Scholar

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., et al. (2013). Playing atari with deep reinforcement learning. arXiv preprint. doi: 10.48550/arXiv.1312.5602

CrossRef Full Text | Google Scholar

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature 518, 529–533. doi: 10.1038/nature14236

PubMed Abstract | CrossRef Full Text | Google Scholar

Paletta, L., and Pinz, A. (2000). Active object recognition by view integration and reinforcement learning. Robot. Auton. Syst. 31, 71–86. doi: 10.1016/S0921-8890(99)00079-2

CrossRef Full Text | Google Scholar

Parr, T., Sajid, N., Da Costa, L., Mirza, M. B., and Friston, K. J. (2021). Generative models for active vision. Front. Neurorobot. 15, 651432. doi: 10.3389/fnbot.2021.651432

PubMed Abstract | CrossRef Full Text | Google Scholar

Patten, T., Zillich, M., Fitch, R., Vincze, M., and Sukkarieh, S. (2015). Viewpoint evaluation for online 3-d active object classification. IEEE Robot. Autom. Lett. 1, 73–81. doi: 10.1109/LRA.2015.2506901

CrossRef Full Text | Google Scholar

Popordanoska, T., Sayer, R., and Blaschko, M. B. (2022). A consistent and differentiable lp canonical calibration error estimator. arXiv preprint. doi: 10.48550/arXiv.2210.07810

CrossRef Full Text | Google Scholar

Potthast, C., Breitenmoser, A., Sha, F., and Sukhatme, G. S. (2016). Active multi-view object recognition: a unifying view on online feature selection and view planning. Robot Auton. Syst. 84, 31–47. doi: 10.1016/j.robot.2016.06.013

CrossRef Full Text | Google Scholar

Roynard, X., Deschaud, J.-E., and Goulette, F. (2018). “Paris-lille-3d: a point cloud dataset for urban scene segmentation and classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops(CVPR) (Salt Lake City, UT: IEEE), 2027–2030.

Google Scholar

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). “Prioritized experience replay,” in Proceedings of the International Conference on Learning Representations (ICLR).

Google Scholar

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). “Trust region policy optimization,” in International Conference on Machine Learning (PMLR), 1889–1897. doi: 10.48550/arXiv.1502.05477

PubMed Abstract | CrossRef Full Text | Google Scholar

Shi, Q., Lam, H.-K., Xuan, C., and Chen, M. (2020). Adaptive neuro-fuzzy pid controller based on twin delayed deep deterministic policy gradient algorithm. Neurocomputing 402, 183–194. doi: 10.1016/j.neucom.2020.03.063

CrossRef Full Text | Google Scholar

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). “Deterministic policy gradient algorithms,” in International Conference on Machine Learning (PMLR), 387–395.

PubMed Abstract | Google Scholar

Stria, J., and Hlavác, V. (2018). “Classification of hanging garments using learned features extracted from 3d point clouds,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (Madrid: IEEE), 5307–5312.

Google Scholar

Sun, Y., Ran, X., Zhang, G., Wang, X., and Xu, H. (2020). Auv path following controlled by modified deep deterministic policy gradient. Ocean Eng. 210, 107360. doi: 10.1016/j.oceaneng.2020.107360

CrossRef Full Text | Google Scholar

Sutton, R. S., and Barto, A. G. (2018). Reinforcement Learning: An Introduction. Cambridge: MIT Press.

Google Scholar

Van de Maele, T., Verbelen, T., Çatal, O., and Dhoedt, B. (2022). Embodied object representation learning and recognition. Front. Neurorobot. 16, 840658. doi: 10.3389/fnbot.2022.840658

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, Y., Sun, J., He, H., and Sun, C. (2020). Deterministic policy gradient with integral compensator for robust quadrotor control. IEEE Trans. Syst. Man Cybernet. Syst. 50, 3713–3725. doi: 10.1109/TSMC.2018.2884725

CrossRef Full Text | Google Scholar

Wu, J., Yang, Z., Liao, L., He, N., Wang, Z., and Wang, C. (2022). A state-compensated deep deterministic policy gradient algorithm for uav trajectory tracking. Machines 10, 496. doi: 10.3390/machines10070496

CrossRef Full Text | Google Scholar

Wu, K., Ranasinghe, R., and Dissanayake, G. (2015). “Active recognition and pose estimation of household objects in clutter,” in 2015 IEEE International Conference on Robotics and Automation (ICRA) (Seattle, WA: IEEE), 4230–4237.

Google Scholar

Zeng, R., Wen, Y., Zhao, W., and Liu, Y.-J. (2020). View planning in robot active vision: a survey of systems, algorithms, and applications. Comput. Vis. Media 6, 225–245. doi: 10.1007/s41095-020-0179-3

CrossRef Full Text | Google Scholar

Zhang, T., Zhu, K., and Wang, J. (2020). Energy-efficient mode selection and resource allocation for d2d-enabled heterogeneous networks: a deep reinforcement learning approach. IEEE T. Wirel. Commun. 20, 1175–1187. doi: 10.1109/TWC.2020.3031436

CrossRef Full Text | Google Scholar

Zhao, D., Chen, Y., and Lv, L. (2016). Deep reinforcement learning with visual attention for vehicle classification. IEEE Tran. Cogn. Dev. Syst. 9, 356–367. doi: 10.1109/TCDS.2016.2614675

CrossRef Full Text | Google Scholar

留言 (0)

沒有登入
gif