Taming the Baffling Unpredictability of AI

Taming the Baffling Unpredictability of AI

Author Name

Roshan Bharwaney and Angelique Faustino

Published On

April 17, 2025

Keywords/Tags

Performance Tuning, Human-AI Collaboration, Error Management, Strategic Guidance, Bias Mitigation

What first drew Rissa to her partner, Joel, was his uncanny ability to make her laugh just when she least expected it. Of course, his unpredictability wasn’t always so charming. When their discussions veered into arguments, Joel’s logic seemed maddingly elusive. Rissa often wished she could turn a dial – a sort of “temperature” knob she could use – to momentarily steer Joel back toward reason.

Surprisingly enough, such a knob exists – not for human relationships, but within the fascinating world of generative AI prompt engineering. OpenAI refers to it through parameters like “temperature,” “presence penalty,” and “frequency penalty.” Other leading generative models have similar knobs, known as “top-p” (or nucleus sampling), “entropy penalty,” and “diversity penalty”. These API prompt parameters serve as formal directives to the model to explore novel responses or to stick with safer, more predictable territory. As with humans, part of AI’s charm lies in its unpredictability, its flashes of insight, moments of humor, and even its occasional quirks. The variability that occasionally frustrates can also be the very thing that keeps AI, and people, genuinely engaging.

Charm Defensive

Despite impressive advancements, AI can still confound us through responses that miss the mark on accuracy, creativity, or even what we view as common sense. Such missteps can prompt criticism and raise concerns about “hallucinations” or AI’s lack of intuition and authentic emotion. Yet humans, too, regularly grapple with bias, miscommunication, inconsistent judgment, and don’t always respond identically to the same question over time. In many professional domains, human errors can surpass those of AI (Almog et al., 2024; Maslej et. al, 2024; Rosenbacke et al., 2024; Logol, 2023). Managers and clients frequently need to clarify their expectations or needs and provide feedback or correction to colleagues to enhance mutual understanding and performance. Similarly, AI outputs improve dramatically when we give the model feedback and clarify and refine initial prompts.

While human errors inspire romantic comedies celebrating our ability to adapt spontaneously, perhaps we are less forgiving of AI errors because we fear they could escalate into something large-scale and systemic. AI errors emerge from biases embedded in training data, algorithmic limitations, known model architectures, or overly rigid pattern recognition. Biased datasets might lead LLMs to reinforce existing prejudices, while algorithmic constraints can lead to overly simplistic interpretations of complex tasks. Human mistakes, on the other hand, can arise from contextual misunderstandings, knowledge gaps, cognitive biases, rushed decisions, poor listening, memory lapses, emotional influences, or simply overlooking critical details out of fatigue or distraction.

AI Blunder Management

The differences between human and AI error and learning have profound implications for AI safety and clarify the opportunity for improvement. AI learning is algorithmically stochastic, while humans learn from a variety of methods, including experience, reflection, feedback, experimentation, and social interaction. The structured nature of AI means its errors are more predictable and correctable, allowing us to build mechanisms that improve performance and reliability – while still preserving that je-ne-sais-quoi that first charmed us in those early ChatGPT moments.

Beyond carefully crafted prompts, here are five methods that can steer AI towards more reliable results:

Define Clear Goals and Boundaries, just as you would to set up a new hire for success, but recognize that AI can often achieve performance beyond entry-level. Set a target for performance that’s “good enough” for a specific task in a limited domain. Define “good enough” by designing and implementing evaluation protocols to test, nudge, or even challenge the model at critical checkpoints.
Raise the Bar (inspired by Amazon’s “50% Rule”): Amazon’s hiring process famously seeks to raise the bar by specifying that every new hire should outperform 50% of their peers in similar roles (Amazon, 2019). Raise the bar on AI by evaluating exactly how a model or agent can be better than its peers (human or AI). A/B testing and monitoring different models or model configurations can help identify ways models are better or worse and refine success criteria.
Train with Quality over Quantity: New hires and new models do better when trained on intentionally curated, trustworthy, domain-specific data. For factual, up-to-date information you want AI to memorize, leverage retrieval-augmented generation (RAG). For specialized knowledge you want AI to internalize, use labeled, task-specific data through supervised fine-tuning (SFT). Define tenets – ethical principles or guidelines to help models operate within accuracy standards or behavioral norms.
Calibrate by Consensus: Similar to designing a candidate interview loop, create an ensemble of models with diverse perspectives to propose, evaluate, and critique outcomes to refine results. Whether voting on the best option or evaluating the work of their predecessors, a layered, iterative approach can mimic collaborative human decision-making through collective intelligence. Combine LLMs with more deterministic rule based systems or traditional machine learning models for balance. It helps to hire the right model for the job and give them the best tools (e.g., internal database access). Out of the box thinking could be just the thing for some tasks, but inappropriate for others.
Human-AI Mentoring: Take a page from pair programming and put a mentor in the loop. Human-AI combinations perform better than any individual contributor alone, human or AI. AI can also break down functional silos, stimulating a balanced approach when tasks are cross-functional. Furthermore, reinforcement learning (RL) techniques help models adapt to dynamic environments through reward-based feedback, creating models that can adapt to more dynamic environments (Vaccaro et al., 2024; Dell’Acqua et al., 2025; Logol, 2023).

Strategic Vibing with AI

Generative AI is less a tool and more a dynamic collaborator whose abilities grow with clear guidance, structured feedback, and thoughtful interventions. Effective AI management is remarkably similar to nurturing successful human collaboration. Our relationships with colleagues and AI both benefit from clarity, ongoing communication, and the wisdom to leverage strengths while safeguarding against cumulative errors.

Just as human relationships thrive on patience, communication, and adaptability, getting the most out of AI requires vibing with its inherent indeterminism through strategic intention, experimentation, and informed care. After all, it’s precisely this delicate balance between control and creative freedom that allows AI, and human partnerships, to truly flourish.

References

Almog, D., Gauriot, R., Page, L., & Martin, D. (2024, July). AI oversight and human mistakes: evidence from centre court. In Proceedings of the 25th ACM Conference on Economics and Computation,103-105. https://doi.org/10.48550/arXiv.2401.16754
Amazon (2019). What is a ‘Bar Raiser’ at Amazon? Retrieved from:
https://www.aboutamazon.eu/news/working-at-amazon/what-is-a-bar-raiser-at-amazon
Dell’Acqua, F., Ayoubi, C., Lifshitz-Assaf, H., Sadun, R., Mollick, E.R., Mollick, L., Han, Y., Goldman, J., Nair, H., Taub, S, & Lakhani, K.R. (2025). The Cybernetic Teammate: A Field Experiment on Generative AI Reshaping Teamwork and Expertise. Harvard Business Working Paper No. 25-043, http://dx.doi.org/10.2139/ssrn.5188231
Logol (2023). The Range of Error Accepted in AI Utilization vs. Human Error in the Legal Sector. Retrieved from:
https://www.luxatiainternational.com/article/the-range-of-error-accepted-in-ai-utilization-vs-human-error-in-the-legal-sector
Maslej, N., Fattorini, L., Perrault, R., Parli, V., Reuel, A., Brynjolfsson, E., Etchemendy, J., Ligett, K., Lyons, T., Manyika, J., Niebles, J.C., Shoham, Y., Wald, R., & Clark, J. (2024). The AI Index 2024 Annual Report, Institute for Human-Centered AI, Stanford University, Stanford, CA. Retrieved from: https://hai.stanford.edu/ai-index/2024-ai-index-report
Rosenbacke, R., Melhus, Å. & Stuckler, D. (2024). False conflict and false confirmation errors are crucial components of AI accuracy in medical decision making. Nature Communications, 15, 6896. https://doi.org/10.1038/s41467-024-50952-3
Vaccaro, M., Almaatouq, A. & Malone, T. (2024). When combinations of humans and AI are useful: A systematic review and meta-analysis. Nature Human Behavior, 8, 2293–2303. https://doi.org/10.1038/s41562-024-02024-1
Amritha R Warrier & AI4Media / Better Images of AI / CCBY-4.0

Taming the Baffling Unpredictability of AI

Taming the Baffling Unpredictability of AI