Latest Papers On LLM Agents, Medical LLMs, And Medical Reasoning - July 7, 2025

July 11, 2025 by StackCamp Team 80 views

Stay up-to-date with the most recent advancements in LLM agents, medical large language models, and medical reasoning. This article summarizes the latest 15 research papers published up to July 7, 2025, providing a comprehensive overview of the current state-of-the-art. For a better reading experience and access to more papers, check the Github page.

LLM Agents

LLM agents are rapidly transforming various fields by enabling machines to perform complex tasks autonomously. This section highlights the latest research in this dynamic area, exploring topics such as security, video recommendation, cyberattack classification, ad keyword generation, and more. The papers listed below delve into the capabilities and challenges of LLM agents, offering valuable insights for researchers and practitioners alike.

The field of LLM Agents has seen significant growth, with researchers exploring their potential in various domains. One crucial aspect is the security of these agents, as highlighted in “Control at Stake: Evaluating the Security Landscape of LLM-Driven Email Agents”. This paper likely investigates the vulnerabilities and potential risks associated with using LLMs in email communication, an area where security breaches can have severe consequences. Understanding these risks is paramount for developing robust and secure LLM agents.

Another exciting application is in video recommendation, as explored in “VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning”. This research leverages Multi-Modal Large Language Models (MLLMs) and reinforcement learning to create agents that can provide more personalized and effective video recommendations. This approach has the potential to significantly enhance user experience on video streaming platforms.

Cybersecurity is another critical area where LLM agents can make a significant impact. The paper “CyberRAG: An agentic RAG cyber attack classification and reporting tool” introduces a tool that uses agentic Retrieval-Augmented Generation (RAG) to classify and report cyberattacks. This demonstrates the capability of LLM agents to automate and improve cybersecurity measures, a vital aspect in today's digital landscape.

The application of LLM agents extends to marketing and advertising, as seen in “OMS: On-the-fly, Multi-Objective, Self-Reflective Ad Keyword Generation via LLM Agent”. This research focuses on using LLM agents to generate ad keywords in real-time, optimizing for multiple objectives and using self-reflection to improve performance. This can lead to more effective and targeted advertising campaigns.

“Decision-Oriented Text Evaluation” highlights the importance of evaluating text generation models based on their decision-making capabilities. This is a crucial aspect of ensuring that LLM agents not only generate coherent text but also make sound decisions based on the information they process. Furthermore, the future of recommender systems is explored in “The Future is Agentic: Definitions, Perspectives, and Open Challenges of Multi-Agent Recommender Systems”, which discusses the potential of multi-agent systems in enhancing recommendation accuracy and personalization.

Scientific research also benefits from LLM agents, as demonstrated in “Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab”. This paper assesses the ability of language models to perform scientific tasks in a systems biology context, showcasing their potential in accelerating research and discovery. The ethical considerations of LLM agents are addressed in “Evaluating LLM Agent Collusion in Double Auctions”, which examines the possibility of collusion in auction scenarios, emphasizing the need for careful design and monitoring of LLM agents.

In the realm of manufacturing, “AI Agents and Agentic AI-Navigating a Plethora of Concepts for Future Manufacturing” provides an overview of how AI agents can revolutionize the industry, enhancing efficiency and automation. The application of LLM agents in biomedical research is explored in “STELLA: Self-Evolving LLM Agent for Biomedical Research”, which introduces an agent designed to assist researchers in this domain.

Safety is a paramount concern, and “Enhancing LLM Agent Safety via Causal Influence Prompting” presents a method for improving the safety of LLM agents using causal influence prompting. This technique helps in mitigating unintended consequences and ensuring that agents act responsibly. The integration of LLM agents in urban environments is discussed in “Large Language Model Powered Intelligent Urban Agents: Concepts, Capabilities, and Applications”, highlighting their potential in various urban applications.

The social behavior of LLM agents is examined in “Generative Exaggeration in LLM Social Agents: Consistency, Bias, and Toxicity”, which investigates the tendency of these agents to exaggerate and exhibit biases, emphasizing the importance of addressing these issues. Automation of thematic analysis is explored in “Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning”, demonstrating how LLM agents can streamline qualitative research processes.

Finally, the role of LLM agents in breaking down digital barriers is discussed in “LLM Agents Are the Antidote to Walled Gardens”, suggesting that these agents can help users navigate and access information more freely across different platforms. This comprehensive overview of recent research highlights the diverse applications and challenges in the field of LLM agents, underscoring its growing importance in various domains.

Title	Date	Comment
Control at Stake: Evaluating the Security Landscape of LLM-Driven Email Agents	2025-07-03
VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning	2025-07-03
CyberRAG: An agentic RAG cyber attack classification and reporting tool	2025-07-03
OMS: On-the-fly, Multi-Objective, Self-Reflective Ad Keyword Generation via LLM Agent	2025-07-03
Decision-Oriented Text Evaluation	2025-07-03
The Future is Agentic: Definitions, Perspectives, and Open Challenges of Multi-Agent Recommender Systems	2025-07-02
Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab	2025-07-02
Evaluating LLM Agent Collusion in Double Auctions	2025-07-02
AI Agents and Agentic AI-Navigating a Plethora of Concepts for Future Manufacturing	2025-07-02	Submi... Submitted to JMS(March 2025)
STELLA: Self-Evolving LLM Agent for Biomedical Research	2025-07-01
Enhancing LLM Agent Safety via Causal Influence Prompting	2025-07-01	Accep... Accepted at ACL 2025 Findings, Source code: https://github.com/HahmDY/causal_influence_prompting.git
Large Language Model Powered Intelligent Urban Agents: Concepts, Capabilities, and Applications	2025-07-01
Generative Exaggeration in LLM Social Agents: Consistency, Bias, and Toxicity	2025-07-01
Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning	2025-06-30	Prese... Presented at ACL 2025 SRW
LLM Agents Are the Antidote to Walled Gardens	2025-06-30

Medical Large Language Models

Medical Large Language Models (Medical LLMs) are revolutionizing healthcare by providing advanced diagnostic and analytical capabilities. This section reviews the latest papers focusing on the development, evaluation, and application of these models. The research covers a range of topics, including medical ethics, multi-modal diagnosis, reasoning, and the mitigation of hallucinations, offering a comprehensive look at the cutting edge of Medical LLMs.

The use of Medical Large Language Models (Medical LLMs) in healthcare is rapidly expanding, with significant research focused on their capabilities and limitations. One crucial aspect is the ethical considerations in medical decision-making, which is addressed in “MedEthicsQA: A Comprehensive Question Answering Benchmark for Medical Ethics Evaluation of LLMs”. This benchmark likely provides a framework for evaluating how well Medical LLMs handle ethical dilemmas, a critical factor in their adoption in clinical settings.

Multi-modal diagnosis, which combines different types of medical data, is another area of active research. “MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration” introduces a framework that uses multiple agents to collaborate on diagnoses, leveraging the strengths of each agent. This approach can lead to more accurate and comprehensive diagnostic outcomes.

The ability of Medical LLMs to reason and apply knowledge is essential for their effectiveness. “Disentangling Reasoning and Knowledge in Medical Large Language Models” delves into the separation of reasoning and knowledge within these models, aiming to understand how each contributes to their performance. This understanding is crucial for improving the models' capabilities.

The reliability of Medical LLMs is also a key focus, particularly in the context of potential biases and errors. “The MedPerturb Dataset: What Non-Content Perturbations Reveal About Human and Clinical LLM Decision Making” explores how non-content perturbations affect the decision-making of both humans and Medical LLMs, providing insights into their robustness. Similarly, “DeVisE: Behavioral Testing of Medical Large Language Models” presents a method for testing the behavior of these models, ensuring they perform as expected in various scenarios.

Medical LLMs are being explored for various applications, including cancer pre-screening. “Intercept Cancer: Cancer Pre-Screening with Large Scale Healthcare Foundation Models” investigates the use of these models in identifying cancer risk, potentially leading to earlier detection and treatment. The integration of Retrieval-Augmented Generation (RAG) techniques is explored in “MRD-RAG: Enhancing Medical Diagnosis with Multi-Round Retrieval-Augmented Generation”, which aims to improve diagnostic accuracy by incorporating relevant information retrieval.

Benchmarks are crucial for evaluating the performance of Medical LLMs. “ECG-Expert-QA: A Benchmark for Evaluating Medical Large Language Models in Heart Disease Diagnosis” provides a specific benchmark for heart disease diagnosis, allowing for a standardized assessment of model capabilities. The potential for distraction in Medical LLMs is examined in “Medical large language models are easily distracted”, highlighting the need for robust models that can focus on relevant information.

The development of specialized Medical LLMs is also a focus, as seen in “CancerLLM: A Large Language Model in Cancer Domain”, which introduces a model tailored for cancer-related tasks. Hallucination, the generation of incorrect or nonsensical information, is a significant concern, and “MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models” provides a benchmark for assessing this issue.

The importance of construct validity in benchmarks is discussed in “Medical Large Language Model Benchmarks Should Prioritize Construct Validity”, emphasizing the need for benchmarks that accurately measure the intended capabilities. The performance of Chinese Medical LLMs is analyzed in “Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies”, providing insights into their strengths and weaknesses.

Medical LLMs are also being used in personalized medicine, as demonstrated in “Patient-Level Anatomy Meets Scanning-Level Physics: Personalized Federated Low-Dose CT Denoising Empowered by Large Language Model”, which uses these models to improve CT imaging. Finally, the democratization of Medical LLMs across different languages is addressed in “Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts”, highlighting efforts to make these models accessible globally.

Title	Date	Comment
MedEthicsQA: A Comprehensive Question Answering Benchmark for Medical Ethics Evaluation of LLMs	2025-06-28	20 pages
MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration	2025-06-24	ACL 2025 Findings
Disentangling Reasoning and Knowledge in Medical Large Language Models	2025-06-24
The MedPerturb Dataset: What Non-Content Perturbations Reveal About Human and Clinical LLM Decision Making	2025-06-20
DeVisE: Behavioral Testing of Medical Large Language Models	2025-06-18
Intercept Cancer: Cancer Pre-Screening with Large Scale Healthcare Foundation Models	2025-05-30
MRD-RAG: Enhancing Medical Diagnosis with Multi-Round Retrieval-Augmented Generation	2025-04-10
ECG-Expert-QA: A Benchmark for Evaluating Medical Large Language Models in Heart Disease Diagnosis	2025-04-07
Medical large language models are easily distracted	2025-04-01	20 pa... 20 pages, 2 main figures, 6 extended figures
CancerLLM: A Large Language Model in Cancer Domain	2025-04-01	new v... new version, add the RAG version of cancerLLM
MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models	2025-03-28	Publi... Published to AAAI-25 Bridge Program
Medical Large Language Model Benchmarks Should Prioritize Construct Validity	2025-03-12
Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies	2025-03-10
Patient-Level Anatomy Meets Scanning-Level Physics: Personalized Federated Low-Dose CT Denoising Empowered by Large Language Model	2025-03-02	Accep... Accepted by CVPR 2025
Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts	2025-02-10

Large Language Models

Large Language Models (LLMs) are at the forefront of artificial intelligence research, driving advancements in various applications. This section summarizes recent papers that explore the capabilities, challenges, and innovations in LLMs. Topics include grounded chain-of-thought, reinforcement fine-tuning, multimodal reasoning, and the mitigation of biases. These papers offer a comprehensive view of the current state and future directions of Large Language Models.

The advancements in Large Language Models (LLMs) continue to push the boundaries of artificial intelligence, with researchers exploring new techniques and applications. One area of focus is improving the reasoning capabilities of LLMs, as seen in “Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation”. This paper explores how grounded chain-of-thought can enhance multimodal LLMs, making them more data-efficient in model adaptation. This is crucial for applications requiring complex reasoning and understanding of multimodal inputs.

Another application of LLMs is in requirements elicitation, as highlighted in “Requirements Elicitation Follow-Up Question Generation”. This research focuses on generating follow-up questions to better understand requirements, a critical step in software development and project management. The use of LLMs in this area can streamline the elicitation process and improve the quality of requirements.

Fine-tuning LLMs for specific tasks is also a key area of research. “MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs” introduces a method for modular thinking via reinforcement fine-tuning, allowing LLMs to tackle complex tasks more effectively. This approach can enhance the performance of LLMs in various domains by tailoring them to specific needs.

The security of LLMs is a growing concern, and “Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection” explores visual contextual attacks that can jailbreak multimodal LLMs. This research highlights the vulnerabilities of these models and the need for robust defense mechanisms. Additionally, the use of LLMs in treatment effect estimation is examined in “LLM-Driven Treatment Effect Estimation Under Inference Time Text Confounding”, demonstrating their potential in causal inference tasks.

Watermarking LLMs is crucial for ensuring the authenticity of generated content, and “Improved Unbiased Watermark for Large Language Models” presents an improved method for unbiased watermarking. This technique helps in identifying the source of generated text, which is essential for combating misinformation. The use of LLMs in reinforcement learning is explored in “StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason”, demonstrating how stepwise hints can enhance the reasoning capabilities of reinforcement learning agents.

LLMs are also being used to improve web search and research processes. “From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents” discusses how reasoning agents can incentivize deeper research through web search, potentially revolutionizing information retrieval. The combination of self-explanation and reinforcement learning is explored in “ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning”, aiming to unlock hard reasoning capabilities in LLMs.

The application of LLMs in robotics is demonstrated in “Large Language Model-Driven Closed-Loop UAV Operation with Semantic Observations”, which uses LLMs to drive UAV operations with semantic observations. This showcases the potential of LLMs in autonomous systems and robotics. A framework for auto-route switching in dual-state LLMs is presented in “SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model”, enhancing the flexibility and adaptability of these models.

Multimodal reasoning, which involves processing and understanding different types of data, is another area of advancement. “Multimodal Mathematical Reasoning with Diverse Solving Perspective” explores multimodal mathematical reasoning with diverse solving perspectives, highlighting the potential of LLMs in complex problem-solving. The presence of bias in LLMs is examined in “Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models”, emphasizing the importance of addressing biases in these models.

LLMs are also being used in creative applications, such as video editing. “From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding” introduces a framework for video editing inspired by human narrative understanding. Finally, techniques for accelerating the convergence of LLM pretraining are explored in “GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling”, improving the efficiency of training these models.

Title	Date	Comment
Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation	2025-07-03	Accepted by ICCV2025
Requirements Elicitation Follow-Up Question Generation	2025-07-03	13 pa... 13 pages, 2 figures, accepted at the 33rd IEEE International Requirements Engineering 2025
MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs	2025-07-03
Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection	2025-07-03	16 pages
LLM-Driven Treatment Effect Estimation Under Inference Time Text Confounding	2025-07-03
Improved Unbiased Watermark for Large Language Models	2025-07-03	ACL 2... ACL 2025 Main Conference
StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason	2025-07-03
From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents	2025-07-03
ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning	2025-07-03
Large Language Model-Driven Closed-Loop UAV Operation with Semantic Observations	2025-07-03	9 pages, 7 figures
SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model	2025-07-03
Multimodal Mathematical Reasoning with Diverse Solving Perspective	2025-07-03	8 pages
Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models	2025-07-03
From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding	2025-07-03
GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling	2025-07-03

Medical Reasoning

Medical reasoning is a critical aspect of healthcare, and LLMs are increasingly being used to enhance this process. This section highlights the latest research in medical reasoning, focusing on approaches that leverage knowledge enhancement, chain-of-thought techniques, and multi-agent systems. These papers explore how LLMs can improve diagnostic accuracy and medical decision-making, offering valuable insights for the future of healthcare.

Medical reasoning is a complex process that requires the integration of knowledge, inference, and clinical guidelines. Large Language Models (LLMs) are being developed to aid in this process, and recent research has focused on enhancing their capabilities. One approach is knowledge enhancement, as seen in “KERAP: A Knowledge-Enhanced Reasoning Approach for Accurate Zero-shot Diagnosis Prediction Using Multi-agent LLMs”. This paper introduces a method that enhances reasoning through knowledge integration, improving the accuracy of zero-shot diagnosis prediction using multi-agent LLMs. This technique is crucial for handling cases where limited data is available.

Chain-of-thought (CoT) methods are also being explored to improve the reasoning process. “V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis” presents a vision-to-text CoT approach for medical reasoning and diagnosis, allowing LLMs to generate step-by-step reasoning from visual inputs. This is particularly useful in radiology and other image-based diagnostic fields. The importance of disentangling reasoning and knowledge is further emphasized in “Disentangling Reasoning and Knowledge in Medical Large Language Models”, which aims to understand how these two components contribute to the performance of Medical LLMs.

Self-correction and reflection are key aspects of improving reasoning capabilities. “Med-REFL: Medical Reasoning Enhancement via Self-Corrected Fine-grained Reflection” introduces a method for enhancing medical reasoning through self-corrected fine-grained reflection, allowing LLMs to refine their reasoning processes iteratively. Multimodal LLMs are also being developed to leverage both textual and visual data, as seen in “MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis”, which presents a multimodal LLM for medical reasoning and diagnosis.

The step-by-step reasoning process is critical for transparency and verifiability in medical applications. “Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs” focuses on enhancing the step-by-step and verifiable nature of medical reasoning in multimodal LLMs. This is essential for building trust in LLM-driven medical decisions. Reinforcement learning is being used to incentivize unified medical reasoning, as demonstrated in “Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning”, which aims to improve the consistency and accuracy of LLM reasoning through reinforcement learning.

Behavioral testing is crucial for ensuring the reliability of Medical LLMs. “DeVisE: Behavioral Testing of Medical Large Language Models” presents a method for testing the behavior of these models, ensuring they perform as expected in various clinical scenarios. Parameter-efficient training methods are also being explored to make Medical LLMs more accessible, as seen in “Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training”, which achieves state-of-the-art medical reasoning with a parameter-efficient approach.

Multi-agent collaboration is another promising direction for enhancing medical reasoning. “MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning” focuses on optimizing multi-agent collaboration for multimodal medical reasoning, allowing different agents to specialize and collaborate on complex cases. The use of LLMs to reason over BM25 scores is explored in “InsertRank: LLMs can reason over BM25 scores to Improve Listwise Reranking”, demonstrating their potential in improving information retrieval tasks.

Guideline-verified process rewards are being used to enhance the reasoning process, as presented in “Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards”. This approach uses medical guidelines to reward accurate reasoning steps, improving the overall quality of the reasoning process. A generalist foundation model for unified multimodal medical understanding and reasoning is introduced in “Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning”, providing a comprehensive solution for various medical tasks.

Datasets are crucial for training and evaluating Medical LLMs. “Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy” presents a multimodal dataset for medical reasoning in gastrointestinal endoscopy. Finally, the generation of datasets using multi-agent systems is explored in “ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning”, demonstrating how multi-agent systems can be used to create large-scale datasets for advancing medical reasoning.

Title	Date	Comment
KERAP: A Knowledge-Enhanced Reasoning Approach for Accurate Zero-shot Diagnosis Prediction Using Multi-agent LLMs	2025-07-03
V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis	2025-06-27	12 pages, 4 figures
Disentangling Reasoning and Knowledge in Medical Large Language Models	2025-06-24
Med-REFL: Medical Reasoning Enhancement via Self-Corrected Fine-grained Reflection	2025-06-23
MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis	2025-06-23
Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs	2025-06-20
Med-U1: Incentivizing Unified Medical Reasoning in LLMs via Large-scale Reinforcement Learning	2025-06-20
DeVisE: Behavioral Testing of Medical Large Language Models	2025-06-18
Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training	2025-06-18
MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning	2025-06-17
InsertRank: LLMs can reason over BM25 scores to Improve Listwise Reranking	2025-06-17
Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards	2025-06-13
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning	2025-06-13	Techn... Technical Report, 53 pages, 25 tables, and 16 figures. Our webpage is https://alibaba-damo-academy.github.io/lingshu/
Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy	2025-06-11
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning	2025-06-11	24 pa... 24 pages, 6 figures, 7 tables