Learning From the Crowd: A New Approach to Social Learning

Author: Denis Avetisyan


This research introduces a method for agents to effectively learn from observing others, even when those observers have varying levels of expertise.

The framework investigates a social bandit learning problem, positing that effective collaboration necessitates balancing individual reward acquisition with the collective benefit of shared knowledge among agents within a dynamic environment.
The framework investigates a social bandit learning problem, positing that effective collaboration necessitates balancing individual reward acquisition with the collective benefit of shared knowledge among agents within a dynamic environment.

A free energy minimization framework combined with Thompson sampling enables robust social bandit learning in multi-agent systems.

While reinforcement learning typically focuses on individual experiences, humans and animals often benefit from observing others, revealing a critical gap in artificial intelligence. This paper, ‘Exploiting Expertise of Non-Expert and Diverse Agents in Social Bandit Learning: A Free Energy Approach’, addresses this limitation by introducing a novel social bandit learning algorithm grounded in free energy minimization and Thompson sampling. The method enables an agent to effectively learn from observing a population of diverse agents, even those with limited or suboptimal expertise, by strategically evaluating and integrating their behavioral information. Could this approach unlock more robust and adaptable learning systems capable of thriving in complex, dynamic environments?


Beyond the Isolated Agent: Embracing Collective Intelligence

Conventional reinforcement learning paradigms often depict agents as solitary entities, optimizing behavior through individual trial and error. However, this isolated approach represents a considerable simplification of the vast majority of natural systems, where organisms routinely learn from and interact with one another. From flocking birds and schooling fish to social insects and primate groups, collective behavior demonstrates that information sharing and observational learning are frequently more efficient than independent discovery. This is particularly true in dynamic and unpredictable environments, where the experiences of others can dramatically reduce the costs – and risks – associated with exploring novel strategies. Consequently, a growing body of research is challenging the assumption of isolated learners, recognizing that social interaction is often a fundamental component of adaptive behavior.

Traditional reinforcement learning often envisions agents acting in isolation, a framework that falters when confronted with the realities of complex and dynamic environments. These settings are rarely characterized by readily available, centralized information; instead, knowledge is frequently distributed across multiple sources and changes over time. An agent operating solely on its own experiences faces a significant disadvantage, requiring exponentially more exploration to achieve comparable performance to one capable of accessing external insights. This limitation is particularly acute in scenarios where the reward landscape shifts unpredictably, as individual learning struggles to keep pace with evolving optimal strategies. Consequently, the efficacy of purely individual-centric approaches diminishes rapidly as environmental complexity and information distribution increase, underscoring the need for alternative learning paradigms.

A central difficulty in reinforcement learning arises from the inherent tension between exploration and exploitation. An agent must continually decide whether to utilize currently known strategies that yield predictable rewards – exploitation – or to investigate novel approaches that might offer even greater benefits, but carry the risk of failure – exploration. This isn’t merely a matter of chance; it’s a fundamental optimization problem. Excessive exploitation can lead to stagnation in a changing environment, while unrestrained exploration may waste valuable resources and delay the acquisition of reliable rewards. The efficiency with which an agent navigates this tradeoff profoundly impacts its learning speed and ultimate success, necessitating sophisticated algorithms that dynamically adjust the balance based on environmental uncertainty and the potential for improved outcomes. \text{Reward} = \alpha \cdot \text{Exploitation} + (1-\alpha) \cdot \text{Exploration}

Effective adaptation in dynamic and complex environments often necessitates moving beyond individual trial-and-error learning. Organisms rarely operate in complete isolation; instead, they frequently benefit from observing and imitating the successes – and avoiding the failures – of others. This process, known as social learning, allows for the rapid acquisition of beneficial behaviors without the risks associated with independent exploration. By leveraging the collective knowledge of a population, individuals can bypass lengthy learning curves and quickly adopt strategies proven effective by their peers. This is particularly crucial in situations where environmental conditions change rapidly, or where individual exploration is costly or dangerous, demonstrating that the capacity to learn from others is often as important as the capacity to learn by doing.

Social learning agents (OUCB, TUCB, SBL-FE) consistently outperformed UCB and TS baselines across varying optimality gaps and horizons (200 and 10000) in a 2-armed Bernoulli bandit problem with one social and one non-social learner.
Social learning agents (OUCB, TUCB, SBL-FE) consistently outperformed UCB and TS baselines across varying optimality gaps and horizons (200 and 10000) in a 2-armed Bernoulli bandit problem with one social and one non-social learner.

A Collective Approach: Social Bandit Learning in Action

Social Bandit Learning (SBL) represents a departure from standard bandit algorithms by explicitly integrating information derived from the actions of other agents within the same environment. Traditional bandit methods, such as Upper Confidence Bound (UCB) and Thompson Sampling (TS), operate under the assumption that an agent learns solely through its own experiences and rewards. SBL, however, allows an agent to observe the actions taken by its peers – specifically, which arms they selected – and incorporate this observational data into its decision-making process. This is typically achieved by weighting observed actions based on the observer’s assessment of the observed agent’s competence or reward signal, effectively allowing the learning agent to benefit from the trials and successes of others without directly experiencing those outcomes itself.

Social Bandit Learning (SBL) is particularly well-suited for modeling environments populated by agents exhibiting varied levels of expertise and potentially conflicting objectives. Unlike traditional bandit algorithms that assume a single, optimizing agent, SBL acknowledges the prevalence of heterogeneous agents in real-world scenarios. This is achieved by allowing agents to learn not only from their own experiences, but also by observing and incorporating information from the actions of other agents, even if those agents are pursuing different goals or possess differing prior knowledge. Consequently, SBL can effectively function in multi-agent systems where complete information sharing or centralized control is impractical or impossible, reflecting a more realistic depiction of complex adaptive systems.

Social Bandit Learning (SBL) demonstrably improves learning efficiency and overall performance in complex environments by incorporating observational data from other agents. Empirical results indicate that SBL algorithms achieve a substantial reduction in cumulative regret – the difference between the reward obtained by the agent and the reward of the optimal action – when compared to standard individual bandit learning methods such as Upper Confidence Bound (UCB) and Thompson Sampling (TS). This improvement stems from the ability of agents to quickly identify and exploit effective strategies exhibited by others, circumventing the need for extensive independent exploration and minimizing suboptimal action selection. The magnitude of regret reduction varies depending on the complexity of the environment and the number of observing agents, but consistently shows a performance advantage for SBL in heterogeneous agent systems.

Social transmission within Social Bandit Learning (SBL) enables agents to improve their decision-making processes by incorporating information derived from the actions of other agents in the environment. This is achieved by observing which actions others take and the resulting rewards, effectively creating a form of distributed knowledge sharing. Rather than each agent learning solely through independent exploration, SBL allows successful strategies to propagate through the group, leading to faster convergence on optimal or near-optimal policies. The collective intelligence of the group, therefore, becomes a valuable resource, supplementing individual learning and increasing overall efficiency, particularly in scenarios with sparse rewards or high dimensionality.

In a 10-armed Bernoulli bandit problem with an optimality gap of <span class="katex-eq" data-katex-display="false">\Delta=0.2</span>, social learning agents (OUCB, TUCB, SBL-FE) demonstrated cumulative regret performance comparable to baseline methods (UCB, TS), with TS showing similar performance to our approach in certain scenarios over both 200 and 2000 trials.
In a 10-armed Bernoulli bandit problem with an optimality gap of \Delta=0.2, social learning agents (OUCB, TUCB, SBL-FE) demonstrated cumulative regret performance comparable to baseline methods (UCB, TS), with TS showing similar performance to our approach in certain scenarios over both 200 and 2000 trials.

Mechanisms for Efficient Decision-Making: A Closer Look

Thompson Sampling (TS) is a Bayesian approach to solving multi-armed bandit problems, characterized by its ability to dynamically balance exploration and exploitation. Unlike methods such as \epsilon \$-greedy, TS maintains a probability distribution representing the belief about the value of each action. At each decision step, a sample is drawn from each action’s distribution, and the action with the highest sampled value is selected. This process inherently favors actions with high estimated values (exploitation) while simultaneously allowing for continued sampling from actions with high uncertainty (exploration), as the variance of the distribution influences the probability of selecting an action. The resulting algorithm is provably efficient, achieving logarithmic regret bounds, and is particularly effective in non-stationary environments due to its adaptive nature.

Thompson Sampling (TS) operates by representing the expected reward of each possible action with a probability distribution, typically a Beta distribution when rewards are binary or a Gaussian distribution for continuous rewards. This probabilistic representation allows the agent to quantify uncertainty about the true value of each action. During decision-making, TS samples a value from each action’s distribution and selects the action with the highest sampled value. This process inherently balances exploration – by occasionally sampling low values from potentially high-reward actions – and exploitation – by consistently selecting actions with high estimated values. Critically, even with a small number of observations, the maintained distributions provide a sufficient basis for informed action selection, as the sampling procedure reflects both the estimated mean reward and the associated uncertainty, improving performance compared to methods relying solely on point estimates.

The integration of Free Energy Minimization (FEM) into the Bayesian framework of Sequential Belief Learning (SBL) offers a neurobiologically plausible model of bounded-rational decision-making. FEM, rooted in predictive processing, posits that agents actively minimize surprisal – the difference between predicted and actual sensory input – by updating their internal beliefs about the environment. Within SBL, this minimization is achieved through Bayesian inference, where agents maintain a probability distribution over possible states and actions. Computational constraints inherent in biological systems limit the complexity of these calculations; FEM provides a principled mechanism for approximating optimal Bayesian inference under these limitations, prioritizing the reduction of uncertainty and efficient action selection. This approach explains how agents can arrive at near-optimal policies even with incomplete information and limited computational resources, aligning with observed behavioral patterns and neural mechanisms.

Free Energy Minimization, as applied to Sequential Bayesian Learning (SBL), posits that agents operating under computational limitations strive to minimize uncertainty in predicting future outcomes. This minimization process directly influences action selection, favoring options that reduce prediction error given available resources. The resulting behavior isn’t necessarily optimal in a strict sense, but rather a bounded-rational approximation thereof. Specifically, Theorem 1 demonstrates that under certain conditions, this minimization strategy leads to convergence towards the optimal policy, albeit potentially with a non-zero asymptotic error reflecting the constraints on computational capacity. This convergence is achieved by balancing the need for exploration to reduce uncertainty with the exploitation of current knowledge, effectively optimizing action selection within the defined computational budget.

In a 10-armed Bernoulli bandit task with an optimality gap of <span class="katex-eq" data-katex-display="false">\Delta=0.2</span>, social learning agents (OUCB, TUCB, SBL-FE) outperformed baseline methods (UCB, TS) over 200-2000 trials, demonstrating effective learning within a society of one social learner and three epsilon-greedy agents with disjoint action sets.
In a 10-armed Bernoulli bandit task with an optimality gap of \Delta=0.2, social learning agents (OUCB, TUCB, SBL-FE) outperformed baseline methods (UCB, TS) over 200-2000 trials, demonstrating effective learning within a society of one social learner and three epsilon-greedy agents with disjoint action sets.

Adapting to Change: Robustness in Dynamic Worlds

A significant strength of Skill-Based Learning (SBL) lies in its performance within dynamic, non-stationary environments – those where the very rules of reward are subject to change. Unlike traditional reinforcement learning methods that often struggle when faced with shifting conditions, SBL’s continuous learning process allows agents to adapt and maintain effective strategies. This adaptability isn’t simply about reacting to change, but proactively incorporating new information gleaned from both individual experience and observation of others. Consequently, an agent employing SBL can navigate fluctuating reward landscapes – where previously optimal actions become less effective, and new opportunities emerge – with greater resilience and sustained performance, demonstrating a crucial advantage in real-world applications where predictability is limited.

Successful adaptation in dynamic environments hinges on an agent’s capacity to integrate multiple sources of information. Rather than relying solely on individual trial-and-error, agents employing this strategy continuously refine their understanding of the world by observing the successes and failures of others. This social learning component, combined with ongoing personal experience, allows for a more rapid and robust response to shifting conditions. The agent effectively builds a cumulative knowledge base, enabling it to anticipate and navigate changes in reward distributions and maintain peak performance even as the environment evolves – a crucial characteristic for real-world applications where predictability is often limited and flexibility is paramount.

Evaluating the success of Social Behavioral Learning (SBL) hinges on precise performance metrics, notably ‘regret’. This value doesn’t simply measure absolute achievement, but rather quantifies the opportunity cost of learning – the difference between the reward an agent actually receives and the reward it could have received had it consistently followed the optimal policy. A low regret score indicates efficient learning and swift adaptation, demonstrating the agent minimized losses during the learning process. Calculating regret involves comparing the cumulative reward achieved by the SBL agent to that of a hypothetical ‘oracle’ agent possessing perfect knowledge. This comparative analysis provides a robust measure of learning efficiency and allows for meaningful benchmarking against other learning algorithms, especially in dynamic environments where the optimal policy itself may evolve over time. Regret = \sum_{t=1}^{T} (R_{optimal} - R_{agent})

The true strength of this approach lies in its practical implications for dynamic, real-world scenarios; unlike many learning algorithms that falter when conditions shift, this method exhibits remarkable adaptability. Environments are rarely static – reward structures evolve, unforeseen obstacles emerge, and even the very data available to an agent can be noisy or incomplete. This learning strategy doesn’t merely react to these changes, but actively incorporates them into its ongoing decision-making process. Through continuous learning and refinement based on both individual experiences and observations of others, the system maintains performance even amidst substantial environmental variation and the presence of observation noise, proving its robustness and paving the way for deployment in complex, unpredictable settings.

Across 2000 trials of a 10-armed Bernoulli bandit problem with <span class="katex-eq" data-katex-display="false">\Delta=0.2</span>, our social learning agent (SBL-FE) demonstrates competitive cumulative regret performance alongside established algorithms, exhibiting per-trial selection probabilities that reflect societal influences.
Across 2000 trials of a 10-armed Bernoulli bandit problem with \Delta=0.2, our social learning agent (SBL-FE) demonstrates competitive cumulative regret performance alongside established algorithms, exhibiting per-trial selection probabilities that reflect societal influences.

The pursuit of robust learning systems necessitates a focus on fundamental principles. This work, centered on social bandit learning and free energy minimization, echoes that sentiment. It demonstrates how an agent can benefit from observing others, even without complete knowledge of their capabilities-a testament to the power of leveraging collective intelligence. As Paul Erdős once stated, “A mathematician knows a lot of things, but a physicist knows the deep underlying principles.” Similarly, this research isn’t merely about achieving performance; it’s about uncovering the deep principles that govern effective learning in multi-agent systems. The elegance of the approach lies in its ability to distill complex interactions into a framework driven by exploration-exploitation and free energy, highlighting how structure dictates behavior within the learning process.

What Lies Ahead?

The pursuit of learning through observation, as demonstrated by this work, inevitably highlights the inherent complexities of agency and environment. Minimizing free energy offers a compelling architectural principle, yet the elegance of the formulation does not dissolve the difficulty of specifying the generative model itself. A truly robust system must account for the non-stationarity of both the environment and the observed agents – their expertise, biases, and even motivations are not fixed points. Modifying one component of this social learning architecture – say, the Thompson sampling mechanism – triggers a cascade of consequences throughout the entire system, demanding a holistic understanding.

Future investigations should address the limits of scalability. While the approach demonstrates efficacy with a relatively small number of agents, the computational burden of maintaining probabilistic representations for a large, dynamic population is substantial. A deeper exploration of approximate inference techniques, and potentially hierarchical modeling, seems crucial. Furthermore, the assumption of a shared environment, while simplifying the initial problem, is rarely met in practice. Developing mechanisms for agents to discern and adapt to differing environmental contexts will be paramount.

Ultimately, this line of inquiry points toward a broader challenge: constructing artificial systems capable of not merely adapting to complexity, but of internalizing a coherent model of the world – a model that encompasses not only physical laws, but also the intentions and beliefs of other agents. The path forward requires a commitment to understanding the intricate interplay between individual agency, social dynamics, and the underlying structure of the environment.


Original article: https://arxiv.org/pdf/2603.11757.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-13 19:34