OiO.lk Blog python ValueError: mismatched dimensions in reward arrays during DAgger imitation learning
python

ValueError: mismatched dimensions in reward arrays during DAgger imitation learning


I’m implementing imitation learning using the DAgger algorithm from the imitation library in Python. The environment I’m working with is a custom Gym environment that simulates a shallow lake management problem. The expert policy is generated from an optimization process, and I’m trying to train a learner policy using the DAgger framework.

I’m encountering the following error when attempting to extend and update the DAgger trainer with new demonstrations:

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 51 has 1 dimension(s)

The error occurs in the rollout.py file within the imitation library, specifically in the flatten_trajectories function when trying to concatenate reward arrays:

File "/imitation/data/types.py", line 224, in concatenate_maybe_dictobs return np.concatenate(arrs)

Steps Taken to Debug:

  1. I’ve already tried reshaping the reward array within my environment to ensure that it is always a 1D array.

    • In my step() function, I flatten the rewards as follows:
    rewards = np.array(rewards, dtype=np.float32).reshape(batch_size, )
    

    I confirmed that within my environment, the rewards are consistently shaped as (1,).

  2. I’ve added debugging statements in rollout.py to track the reward shapes. Interestingly, all rewards seem to have the shape (1, 1) before being replaced, even though I expect them to be (1,).

  3. I was initially using a RolloutInfoWrapper in my environment’s vectorized wrapper:

    lake_env = DummyVecEnv([lambda: RolloutInfoWrapper(FlattenObservation(gym.make("LakeEnv-v1"))) for _ in range(n_envs)])
    

    After removing RolloutInfoWrapper, I still encountered the same issue.

  4. Here’s my environment setup:

    • The action space is continuous: spaces.Box(low=np.array([0.01]), high=np.array([0.1]), dtype=np.float32)
    • The observation space is a single scalar: spaces.Box(low=0.0, high=2.0, shape=(1,), dtype=np.float32)
    • The reward is calculated based on the imitation accuracy of the learner relative to the expert.

Despite these adjustments, I continue to encounter the dimension mismatch in the flatten_trajectories function during the rollout phase.

Relevant Code:

Here’s a snippet of my environment’s step() function for context:

def step(self, learner_action):
    if len(learner_action.shape) == 2:  # if batched, shape should be (batch_size, 1)
        batch_size = learner_action.shape[0]
    else:
        batch_size = 1

    lake_states = np.repeat(self.lake_state, batch_size)
    learner_action = np.clip(learner_action, 0.01, 0.1)
    next_lake_states = []
    rewards = []
    terminated = []

    for i in range(batch_size):
        lake_state = lake_states[i]
        action = learner_action[i]

        # Lake dynamics and reward calculation
        P_recycling = ((lake_state ** self.q) / (1 + lake_state ** self.q))
        natural_inflow = np.random.lognormal(mean=mu, sigma=sigma)
        next_lake_state = lake_state * (1 - self.b) + P_recycling + action + natural_inflow
        next_lake_state = np.clip(next_lake_state, self.observation_space.low[0], self.observation_space.high[0])
        
        expert_action = self.expert_policy.predict([lake_state])[0]
        reward = float(-abs(action - expert_action))  # Reward based on imitation accuracy

        next_lake_states.append(next_lake_state)
        rewards.append(reward)
        terminated.append(self.year >= self.n_years)

    obs = np.array(next_lake_states, dtype=np.float32).reshape(batch_size, 1)
    rewards = np.array(rewards, dtype=np.float32).reshape(batch_size,)  # Reshaping to 1D

    return obs, rewards, terminated, False, {}
  • What could be causing the reward arrays to still have inconsistent shapes during the flatten_trajectories function?
  • How can I ensure that all rewards are consistently 1D arrays throughout the rollout process and avoid this dimension mismatch?
  • Is there something I’m missing in my environment’s step function or the way I’m handling the rewards?



You need to sign in to view this answers

Exit mobile version