Nikita Haduong♠Irene WangBo-Ru Lu♠
Prithviraj Ammanabrolu♣Noah A. Smith♠♢
♠University of Washington ♣University of California, San Diego ♢Allen Institute for AI
{qu,nasmith}@cs.washington.edu roylu@washington.eduprithvi@ucsd.edu
Abstract
Teams can outperform individuals; could adding AI teammates further bolster performance of teams solving problems collaboratively?Collaborative problem solving (CPS) research commonly studies teams with two agents (human-human or human-AI), but team research literature finds that, for complex tasks, larger teams are more effective.Progress in studying collaboration with more than two agents, through textual records of team interactions, is hindered by a major data challenge: available CPS corpora are predominantly dyadic, and adapting pre-existing CPS tasks to more agents is non-trivial.We address this data challenge by developing a CPS task generator, CPS-TaskForge, that can produce environments for studying CPS under a wide array of conditions, and releasing a CPS task design checklist grounded in the theoretical PISA 2015 CPS framework to help facilitate the development of CPS corpora with more agents.CPS-TaskForge takes the form of a resource management (tower defense) game, and different CPS tasks can be studied by manipulating game design parameters.We conduct a case study with groups of 3–4 humans to validate production of diverse natural language CPS communication in a game instance produced by CPS-TaskForge.We discuss opportunities for advancing research in CPS (both with human-only and human-AI teams) using different task configurations.We will release data and code.
CPS-TaskForge: GeneratingCollaborative Problem Solving Environments for Diverse Communication Tasks
Nikita Haduong♠Irene WangBo-Ru Lu♠Prithviraj Ammanabrolu♣Noah A. Smith♠♢♠University of Washington ♣University of California, San Diego ♢Allen Institute for AI{qu,nasmith}@cs.washington.edu roylu@washington.eduprithvi@ucsd.edu
1 Introduction
Modern life requires teamwork to solve problems(Marks etal., 2001), but what makes a team work well together? This area of study, known as collaborative problem solving (CPS), is active across many disciplines, e.g., psychologists study the construction of team mental models in team discussionsLee (2015), business management sciences investigate how communication style affects performance evaluation(Proell etal., 2022), and educators develop tools to teach team communication strategies(Stewart etal., 2023), emphasizing the research direction of discovering how team members talk to one another.Conducting empirical work in CPS faces many challenges, in large part because of a large CPS task design space (e.g., what is the problem, who makes up the team, and who knows what information when).As a result, despite extensive interdisciplinary work in CPS, task designs in empirical studies have often focused on teams of two collaborating to solve problems such as selecting a designated object, modeling search and rescue, and making decisions.
AI agents have the potential to increase team effectiveness, and developing ways to integrate AI into teams is an active area of research in communities such as HCI(Cai etal., 2019), NLPBansal etal. (2019); Vats etal. (2024), and AI fairness(Lai etal., 2021).Example integrations include AI-assisted decision making with one human and one AI (e.g., cancer diagnosis, Chen etal., 2021) and AI-assisted creative tooling (e.g.,Tsiros and Palladini, 2020; Lu etal., 2024a).Developing these collaborative tools is made possible through open datasets. For example, various Amazon reviews datasets (e.g., Fornaciari and Poesio, 2014 and Ni etal., 2019) have been used to develop sentiment classifiers and deception detectors that can be used as AI-assisted decision makers,and the Reddit WritingPrompts datasetFan etal. (2018) has been valuable in developing co-writing AI systems.Unfortunately, a paucity of open datasets with more than two parties leads to challenges in integrating AI with larger human teams, as we lack understanding of team dynamics when an AI communicates to a team, rather than an individual.
To support CPS study across different designs (e.g., adding a third AI teammate to a two-human team or using voice instead of text communication), we introduce a CPS task environment generator, CPS-TaskForge.CPS-TaskForge instantiates a resource management activity through a tower defense game and supports adjusting a range of CPS design parameters such as team composition, communication method, and how stressful the task is.In a tower defense game, players must defend their base by using limited resources to construct towers that can defeat enemies before the enemies destroy the base.We provide a CPS task design checklist, CPS-✓, adapted from the PISA 2015 theoretical CPS framework (PISA2015) developed by the OECD(OECD, 2017), to support generating the desired task environment with CPS-TaskForge.
We illustrate CPS-TaskForge capabilities by presenting several CPS task designs and conducting a case study that can collect human communication data exhibiting a range of CPS skills, including social skills such as maintaining group communication and cognitive skills such as developing strategic plans. Our study has small groups of 3–4 participants complete a task multiple times with increasing difficulty.We observe many different successful strategies and a wide range in CPS skill usage across teams, demonstrating the versatility of collecting data through CPS-TaskForge.
To summarize our contributions:
- 1.
We identify opportunities and gaps in the interdisciplinary CPS literature. We argue that human team research can help advance human-AI team design; however, there exist challenges associated with the lack of diverse CPS data available to the research community.
- 2.
We introduce CPS-TaskForge, which allows researchers to generate a variety of CPS task environments for studying human and human-AI CPS team processes. We adapt a theoretical CPS framework into a design checklist, CPS-✓, to assist with CPS-TaskForge environment generation.
- 3.
We present a case study using CPS-TaskForge to illustrate the variability of CPS data through a study with more than two agents.We release the conversation and survey data collected during the study as an example of what can be produced using CPS-TaskForge.
2 Collaboration and Problem Solving
Task type | Team Size | Communication Modality | |
KTH Tangrams(Shore etal., 2018) | Object Identification | 2 | Speech |
PentoRef(Zarrieß etal., 2016) | Object Identification | 2 | Multimodal |
TEAMS(Rockenbach etal., 2007) | Forbidden Island™ | 3–4 | Multimodal |
ASIST(Huang etal., 2022) | Search and Rescue | 3 | Multimodal |
CerealBar(Suhr etal., 2019) | Search and Rescue | 2 | Text |
HCRC Map Task(Anderson etal., 1991) | Search and Rescue | 2 | Speech |
PhotoBook(Takmaz etal., 2020) | Object Identification | 2 | Text |
Cards(Potts, 2012) | Search and Rescue | 2 | Text |
Rodrigues etal. (2021) | Object Identification | 2 | Multimodal |
Ma etal. (2023) | Programming | 2 | Multimodal |
Butchibabu etal. (2016) | Search and Deliver | 2 | Text |
Kokel etal. (2022) | Object Construction | 2 | Multimodal |
∙ MRE (Hill etal., 2003) | Decision Making | 21 | Speech |
T-shirt Task Andrews etal. (2019) | Math Problem | 2 | Multimodal |
Volcano Lab Flor etal. (2016) | Science Lab | 2 | Text |
Circuit Lab Graesser etal. (2018) | Science Lab | 3 | Text |
Physics Playground(Sun etal., 2020) | 2D Physics Puzzles | 3 | Multimodal |
Minecraft(Sun etal., 2020) | Minecraft Hour of Code | 3 | Multimodal |
CPSCoach(Stewart etal., 2023) | 2D Physics Puzzles | 2 | Multimodal |
∙ NeoCities Schelble etal. (2022) | Search and Rescue | 3 | Text |
9-11 Firefighting (Hutchins etal., 2008) | Firefighting | — | Speech |
Air Warfare (Hutchins etal., 2008) | Object Identification | 6+ | Speech |
Maritime Interdiction Operations (Hutchins etal., 2008) | Object identification | 3+ | Speech |
Wiltshire etal. (2018) | NASA Moonbase Alpha Simulation | 2 | Speech |
CPS-TaskForge (this work) | Object Identification, Resource Management | 1–4+ | Text, Speech |
Collaborative problem solving (CPS) processes are well-studied for human teams, but when human-AI teams are considered, downstream task performance has been prioritized, leaving human-AI CPS processes understudied.For example, Proell etal. (2022) found human team communication more effective when the appropriate style was used in conjunction with the delivery of relevant information.Humans have different expectations towards AI teammates (Zhang etal., 2023, 2021; Grimes etal., 2021), so human-AI teams may value communication style differently.Studying human-AI CPS processes requires developing the appropriate datasets, but resources for creating such data is deficient.
Understanding how effective and efficient communication can predict successful teamwork requires collecting data in a variety of CPS settings. The tasks used to elicit relevant data often model real-world activities, e.g., rescuing humans from a burning building (ASIST; Corral etal., 2021; Freeman etal., 2021), instruction following through selecting designated objects (e.g., PentoRef,Zarrieß etal., 2016; KTH Tangrams,Shore etal., 2018; PhotoBook,Takmaz etal., 2020; Doll Dialogue, Tenbrink etal., 2017; Paxton etal., 2021), and navigating environments (e.g., HCRC Map Task, Anderson etal., 1991; Effenberger etal., 2021), and use human participants.The resulting datasets have been used to study a wide variety of communication and linguistic phenomena, including language entrainment (i.e., when communicative behavior becomes similar among interlocutors, including lexical choice and rhythm) and common ground building (i.e., when interlocutors develop their own code).To the best of our knowledge, analogous settings incorporating an AI team member in a CPS task have not explored similar communication and linguistic phenomena because only recently has AI-generated natural language become indistinguishable from humans(Clark etal., 2021; Dugan etal., 2022), enabling exploration of AI teammates as peers.Unfortunately, expanding pre-existing datasets to other CPS settings, such as involving an AI agent or a third human team member, is challenging because the tasks were designed to study a specific team composition; for example, what role would a third participant play in a navigation task originally designed for one human to tell another human where to go?
Despite the extensive body of literature studying CPS, publicly available resources remain scarce, particularly when more than two agents are involved.We summarize a sample of CPS task activities in the literature in Table1 to illustrate gaps in task type and team size between studies with or without data release to the research community.
3 CPS-TaskForge and Tower Defense
To advance CPS research, we need ways to systematically study CPS when varying factors, allowing comparison of CPS results across settings.We therefore develop a CPS task environment generator, CPS-TaskForge, which can generate CPS environments with different design factors.We also release a CPS task design checklist, CPS-✓, that describes how varying design factors produces different environments. We defer discussion of CPS-✓to Section 4; here we give a concrete description of the task environments our work targets.
We start with several requirements:(R1) CPS-TaskForge should be built on an activity that can support the different values in CPS-✓;(R2) the activity should be fun, to motivate participant signups, because CPS studies require multiple participants, making schedulinga logistical barrier to conducting CPS research;(R3) the activity should be easy to learn for both participants and researchers, in order to minimize time spent in tutorials and allow researchers to quickly design different CPS studies;and (R4) the activity should easily scale in difficulty to enable CPS research studying effects of expertise on collaboration.
We meet our design requirements by using the Tower Defense (TD) game genre as our CPS-TaskForge activity. The premise of a TD game is to defend a base from enemies by placing towers on the map, which can destroy the enemies. TD games require strategy and resource management—a vital aspect of CPS tasksCare etal. (2015)—and games have been successfully used by the research community to study communication (e.g., Codenames;Shaikh etal., 2023) and collect data (e.g., Verbosity; von Ahn etal., 2006, Duolingo(von Ahn, 2013), SearchWar (Law etal., 2009), and MatchIn(Hacker and von Ahn, 2009).
TD games are known for having a gentle learning curve, short levels (R3), and ease in scaling difficulty through simple designs (R1, R4;Avery etal., 2011). The 2021 mobile market value for TD games was estimated at 940 million USDAnalytica (2022); this popularity suggests the potential for participants to play the game of their own volition (R2).It is also known to support 1–4 players in cooperative play,111Bloons TD 6™ is a commercial game with a 4-player cooperative mode. natively supporting studying human-AI teams involving as few as one human.
We briefly describe what a TD game involves, referencing an in-game screenshot (Figure1) of an environment produced by CPS-TaskForge.In a TD game, the player needs to defend their base (7) from enemies by placing towers on the map whose inhabitants can attack the oncoming enemies. The enemies will appear at designated spawn points (1) and traverse the map along specific paths known to the player, allowing the player to strategize where to place towers effectively.Players must manage their resources (3) (e.g., gold and map real estate) when developing their defense strategy.Levels differ in the enemy spawning behavior (e.g., enemies can spawn without a break, or there is time in between groups of enemies), enemy variants (e.g., a faster or slower enemy), map terrain (e.g., obstacles can prevent tower placement), and player resources (e.g., types of towers, amount of starting gold).The standard TD game has two phases: planning, a static phase where players can place towers on the map, and attack, a dynamic phase during which enemies spawn, and players can react to the changing situation by adjusting their towers.
CPS-TaskForge is built on the open-source Godot222https://godotengine.org game engine, and further details of implementation and the tower defense games it produces are deferred to the documentation of our open-source release, as they are not essential to understanding the research contributions.
What are we studying? E.g., Decision making, collaborative learning, negotiation, exploratory group work, how stress affects communication | ||
Context | Dimension | Example Values |
Problem Scenario | Q1. How is the task evaluated for success? | Binary win/lose, score(time, health) |
⋆Q2. How long does one CPS instance take to complete? | 1 minute for planning and 1 minute for attack | |
⋆Q3. How do skill and expertise scale with repetition? | Levels of similar difficulty are repeated, level difficulty scales by introducing more enemy spawn points | |
Team composition | ⋆Q4. What fraction of teammates are human or AI? | H-H-H, H-AI, H-AI-AI, H-H-AI |
Q5. What is the symmetry of roles? | 2 players have the same support towers, and 1 has all offense towers | |
Q6. How are teammates interdependent? | Support towers are necessary to beat the level | |
Task characteristics | Q7. How open is the solution space? | Only 1 tower placement configuration can win |
Q8. What information is available, and how is new information distributed (if applicable)? | All players have the same information at all times, players must discover enemy spawn sequence | |
Q9. How much stress are players under? | No stress (unlimited planning time) | |
Medium | Q10. What is the communication medium? | Text, voice |
4 CPS-✓: A CPS Task Design Checklist
The PISA 2015 CPS Framework (PISA2015) (OECD, 2017) describes CPS tasks through a set of 15 design factors, showing how different CPS settings can be studied by manipulating different combinations of factors (e.g., team size and composition).To operationalize CPS research goals as design parameters that CPS-TaskForge can use to generate the environment, we define CPS-✓, a design checklist adapted from PISA2015 (Table2). We provide default values for CPS-✓ items in the event that some items are unnecessary to adjust for a particular study.We next explore how different hypothetical research goals can be targeted with different TD games generated by CPS-TaskForge and designed with the help of completing CPS-✓.
Goal: Compare solution quality between all-human teams and mixed human-AI teams.
To compare solution quality, we require a more complex task evaluation function than a simple binary win/lose value (Q1). We can design a scoring function to incorporate the time required to agree on a strategy during the planning phase, the amount of money used, or the distance enemies travel. We can also adjust the solution space size (Q7). A level can have a single solution, requiring a specific strategy for placing towers, and solution quality is evaluated by the speed of figuring out the solution. A level can also have multiple solutions, with solutions rated for quality, e.g., a solution using the minimum amount of towers is harder to achieve than a solution maximizing resource consumption and is thus higher quality. The solution quality comparison between teams can then measure the rate of solving levels with minimal resource consumption.
We want to use team compositions with different fractions of human and AI players (Q4).We can investigate how different team roles and personalities in all-human or mixed human-AI teams affect solution quality (Q5); for example, an all-human team where everyone identifies as a leader and has the same towers could result in poor solution quality due to an increase in conflict over strategy; or a team where a human leader effectively uses support towers from an AI teammate (Q6) may outperform a team with an AI leader who does not request support towers from a human teammate. Since we are interested in manipulating team composition, we can give all players a shared resource pool so that information is updated and distributed to all players simultaneously (Q8).
Goal: Investigate how stress affects team performance and communication.
Stress can affect team performance, learning, and communication(Pfaff, 2012; Savelsbergh etal., 2012; Orasanu etal., 2004), with more successful teams developing adaptive strategies Kontogiannis and Kossiavelou (1999).We can model stressful situations by adjusting the amount of starting resources (money and planning time) to require more dynamic gameplay during the attack phase, forcing players to adapt to a rapidly changing environment (Q9). To design levels requiring more dynamic gameplay, we limit the initial starting resources such that players cannot beat a level by only placing towers during the planning phase. As enemies are defeated, players gain additional gold to spend towards placing more towers and upgrading existing towers, which are required to successfully defend their base.The control condition can then be giving players plentiful starting resources.We will evaluate the task with a simple binary win/lose (Q1) and allow several possible solutions so that teams are not discouraged if they cannot land on the single most optimal solution (Q7).Giving less money and planning time means players have to monitor the changing situation during the attack phase.We enable voice communication (Q11) so that typing speed is not a factor.
Goal: Reimplement and extend prior work.
Although CPS-TaskForge is designed to generate TD games, we can simulate object selection and manipulation tasks by limiting player interaction.
Object Selection. Reference games used in KTHTangrams(Shore etal., 2018) and PentoRef-Take(Zarrieß etal., 2016) are played with two players in the roles Instruction Giver (IG) and Instruction Follower (IF). Both players have a view of the map. The IG is given the game goal (select a specific piece), and the IF can manipulate the map (select the piece).We simulate this task using CPS-TaskForgeby designing levels with towers placed on the board at the start, replacing the tower imagery with a pentomino or tangram.We enable voice communication and end the level when a tower is selected, evaluating success as selecting the correct tower (Q1).
Object Manipulation. Tenbrink etal. (2017) designed a task for furnishing a physical dollhouse. The IG is given the furnished dollhouse, and the IF is given an empty house. The IG needs to instruct the IF to furnish the house, and task success is evaluated by the correctness of object location and orientation.To simulate this task in CPS-TaskForge, we design levels that resemble house interiors, with walls designating rooms and preventing towers from being placed on them. We give the IF a set oftowers that can be placed in the level, replacing the tower imagery with furniture. A tower can span multiple grid spaces on the map, and there are multiple copies of each tower with different orientations. The IG is provided the same level but with towers placed on the map already (similar to the setup for the reference games).Voice chat is enabled for communication.Since CPS-TaskForge produces digital grid-based games, object location and orientation can be automatically evaluated for correctness, improving upon the original setting, where evaluation was manually coded.A limitation of our simulation is that the original task used a physical dollhouse, giving participants multiple perspectives of the board (which could increase task complexity), while our simulation only gives players a single top-down view. 3D simulations or creating multiple 2D perspectives could be explored in future work.
5 Case Study: Communication ofSmall Groups as Task Difficulty Increases
To validate its flexibility, we want to explore whether CPS-TaskForge is capable of producing an environment that elicits diverse collaborative problem solving behavior.Prior work in CPS primarily used tasks with dyads or task reptitions at the same difficulty level, so we design a CPS task where teams of 3–4 people complete a task, aiming to minimize expenditure of gold, at multiple difficulty levels.
We design our CPS-TaskForge environment as follows.Task success is evaluated by the amount of money left unused, enemies destroyed, and health of the base (CPS-✓, Q1).A single level takes 5–8 minutes to complete, depending on level difficulty, and we design 3 levels with increasing difficulty (Appendix3(a); Q2–3).All players are human (Q4), and each player is given 2–4 unique towers from a pool of 12 towers with different properties (AppendixA) so that players have different roles, encouraging all players to engage and suggest usage of their own towers (Q5–6).Players are provided a surplus of gold, and costs are balanced to slightly favor upgrading over placing more towers, giving teams the opportunity to find many successful strategies (Q7).All new information is distributed to players simultaneously (e.g., how much damage an enemy receives from a tower) (Q8).Players are under moderate time stress because each level is calibrated to give ample time to discuss strategy and place towers, and we disabled interaction during the attack phase (Q9). We designated level-specific planning time to ensure the study is completed in a reasonable amount of time.Players can only communicate through text chat (Q10).These design decisions showcase the simplicity with which the TD genre affords the ability to create different CPS task environments.
5.1 Data Collection
12 teams of 3–4 people (total 42 individuals) were recruited to participate in a 1.5-hour study333Our local IRB approved our study. and compensated with a gift card at a rate of 20 USD/hour.The study was conducted both in-person and remotely, and all studies were moderated. Recruitment occurred through school email listings and paper flyers posted around town. Participants were aged 18–24 (72%), 25-31 (18%), and 32+ (10%); 55% of participants were current undergraduates and 36% were in a graduate degree program; a third of participants rated their tower defense game familiarity below 3 on a 5-point Likert scale.Familiarity between teammates was not controlled, allowing some team compositions to contain strangers and others a subset of friends.
The study began with individual pre-surveys collecting basic demographic information, then participants watched a tutorial video explaining how to play the game and played a simple tutorial level together to become familiar with the interface. After the tutorial, they were given time to ask any questions about how to play the game. They then played 3 different levels 3 times each for a total of 9 games. Levels increased in difficulty, but the three rounds were the same for each level. Finally, they completed individual post-surveys containing questions about teamwork quality, team role identity, and team communication.
Dimension | CPS Skill | Example | Count | Avg. Tokens |
---|---|---|---|---|
Social | Maintaining communication | “haha okay” | 222 | 2.3 |
Sharing information | “I have a tower damange all enemies” | 114 | 7.0 | |
Establishing shared understanding | “what does the diamond tower do?” | 67 | 5.4 | |
Negotiating | “do we want to risk getting rid of anything else?” | 38 | 5.4 | |
Cognitive | Representing and formulating | “fires in multiple directions” | 105 | 9.3 |
Planning | “ok we can chokepoint the corners” | 227 | 7.2 | |
Executing Actions | “k i maxed [upgrades]” | 42 | 5.9 | |
Monitoring | “50 seconds D:” | 86 | 5.0 |
The first 4 teams were used to calibrate game difficulty and level designs,and the final dataset contains 7 teams producing 4k utterances with a vocabulary size of 1.2k (Appendix Table4).The average win rate in the final round for levels 1–3 was 100%, 71%, and 71%.
5.2 Observations
We adapt a CPS skill taxonomy developed by Andrews etal. (2019) to describe the communication data, simplifying the initial 10 skill taxonomy to 8 because of low annotation reliability (Table3).444We discuss annotation challenges in Appendix Subsection D.1. We label only explicit natural language communications—the original taxonomy also includes system interactions (e.g., the act of placing a tower could be classified as “executing action”).A sample of 45 utterances of the data was manually annotated by two authors (inter-annotator agreement of 73%), then one author annotated all games for 3 teams (19% of the data).Example team communication is in Appendix Table5, exemplifying planning and directing through natural language, as well as communication through game behavior (e.g., placing a tower at a specified location when requested without using language to acknowledge the request.)
Cognitive CPS skills were used 49% of the time, and 29% of all communication was devoted to developing strategic plans (planning and negotiation skills). Andrews etal. (2019) observed 30% cognitive skill usage using a traditional collaborative math task, suggesting that the TD task in CPS-TaskForge is a viable task for CPS studies.
From the surveys, we saw that the game was positively received, supporting our objective of developing a fun CPS activity (R2). 43% players commented that the game was fun, three players requested an official game release to play with others, and no player complained about task tedium.
5.3 Analysis
Our levels were designed to give players a wide solution space through having an abundance of gold; when teams won in the final round of every level, the remaining gold had standard deviations of 5.5k, 9.3k, 10.6k, in levels 1–3 respectively. This design emphasized problem space exploration over negotiating for a single optimal solution and is reflected in the low “negotiation” skill usage (4%) and high spread of placed towers (Appendix 3(b)).Figure2 shows an example of two teams solving level 2 with different strategies in tower placement and quantity. One team chose to concentrate their towers where the two paths meet so that towers can attack enemies on both routes, while another team placed many towers across the whole map. Our scoring function emphasized minimizing expenditure, so 2(a) received a higher score than 2(b). Rounds were repeated three times, allowing teams to optimize working solutions–however, teams did not learn to significantly change expenditure behavior, which suggests cautious game behavior (Appendix Figure4). Teams 1 and 5 appeared to be confused about the task goal, often spending more money across rounds despite winning a previous round.
6 Related Work
Prior work in CPS has studied a range of factors to understand effective teams, from identifying how the effects of team member personalities to how teamwork processes can be evaluated. When an AI teammate is involved, an important research direction investigates how and why humans choose to rely on AI. Findings from CPS human team processes can lead to improvements in AI agents and discovering how to better integrate AI into human teams to solve more complex problems.
Researchers have investigated how team composition affects human team outcomes (e.g., Ruch etal., 2018; Mathieu etal., 2014; Bell etal., 2018; Hollenbeck etal., 2004, inter alia), discovering predictors of team outcomes through team roles, individual expertise, demographics, and team knowledge.Lykourentzou etal. (2016) found five-person teams with balanced personalities outperformed those with an imbalance in personalities on collaborative tasks.Analogously, Wang etal. (2023) and Fan etal. (2024) were able to improve LM performance on downstream tasks by instructing the LM to simulate teams of domain-specific personas to collaborate internally.Priming an LM agent with a persona enables the simulation of inherited knowledge and linguistic patternsMasumura etal. (2018); Wei etal. (2023); Park etal. (2023), and searching for optimal personas in human-AI teams could lead to improvements in human-AI team performance.
CPS tasks can be evaluated for overall task success, but improving teamwork requires evaluating intermediate processes.Pavez etal. (2022) analyzed over a hundred studies on team performance measurement to propose a framework for evaluating teamwork along 4 dimensions: project team processes, project team emergent states, project team tangible outcomes, and project team perceptual benefits.Educators have classified CPS communication for CPS skill usage to provide feedback to students on how to improve their group communication(Andrews etal., 2019; Graesser etal., 2018; Flor etal., 2016; Stewart etal., 2023).Despite extensive work in evaluating CPS teams, there is little data released to the research community.
The increased deployment of AI in high-stakes collaborative environments, e.g., the medical domain, has opened questions about how an AI collaborator affects human behavior, leading to work in trust and reliability of AI. Humans are known to overrely on AI, following AI suggestions even when they are wrong(Lai and Tan, 2019; Jacobs etal., 2021; Bussone etal., 2015). Gazit etal. (2023), Mesbah etal. (2021), and Lu etal. (2024b) designed studies to understand human (over)reliance on AI using “judge-advisor system” tasks where a human or AI advisor provides advice to a human judge, and the judge is responsible for making the final decision.However, decisions in these tasks are independent, and the judges are not able to explain their reasoning to the advisor in a bid to adjust the advisor’s position, preventing the study of longer-term effects of human-AI interactions and human-AI communication.
7 Conclusion
Human-AI collaborative problem solving tools are rapidly being integrated in real-world work environments. The modern workforce uses teams with more than two parties, but empirical research with larger teams lags behind. The task design space for conducting CPS research is large, and the tooling to systematically explore CPS designs is lacking. Our CPS task environment generator, CPS-TaskForge, enables diverse, systematic CPS research through a tower defense game environment that appeals to human subjects and is grounded in theory.
We will release all code for CPS-TaskForge and communication data collected in our case study to encourage studying multi-human and multi-AI collaborative problem solving.
8 Limitations
The tower defense task in CPS-TaskForge environments has a learning curve (albeit a gentle one), so tutorials and practice before the actual study commences may be longer than simpler tasks such as a reference game. This complexity is necessary to support a broad range of complex tasks. CPS-TaskForge environments currently only support a top-down perspective of the world, so supporting first-person settings (e.g., simulating a Minecraft search and rescue task) is infeasible. We believe these design limitations can encourage the development of other similarly specialized CPS environment generators.
Our initial release of CPS-TaskForge implements many common attributes of tower defense games. There are many more attributes available for implemention that have been successfully deployed in commercial tower defense games that may be beneficial for future CPS studies, such as increasing the task difficulty by giving enemies resistance to certain towers. We hope to see CPS-TaskForge evolve in its feature set through usage.
Although CPS-TaskForge was developed in English, and our case study used English, usage of CPS-TaskForge does not require English. CPS-TaskForge was built in the open-source game engine Godot which natively supports other languages and localization.
CPS-✓is adapted from PISA2015, but the CPS researcher may find other CPS frameworks (e.g., ATSC21 Hesse etal. (2015), and the generalized competancy model by Sun etal., 2020) more appropriate as a checklist. We expect adapting other frameworks into a checklist that can be used to generate CPS-TaskForge environments should not be a major challenge, as other frameworks are describing CPS tasks using different attributes, and the TD game used in CPS-TaskForge is fundamentally a CPS task.
9 Ethical Considerations
The flexibility in designing CPS task environments through CPS-TaskForge necessarily places a large responsibility on the designer to design studies appropriate for their target audience or research goal. For example, the imagery used in-game for enemies and towers could be offensive to certain audiences and should be adapted as needed. As with any study in communication, appropriate content filter measures should be in place as required.
The development of generative AI agents as peers that can communicate with humans comes with the risks of the AI agents generating inappropriate content and the concerns of AI replacing humans. Our intentions are that the AI agents can augment human capabilities in more complex problem solving situations, boosting CPS abilities; however, we acknowledge that some problem solving tasks can be simulated and solved through internal or multi-agent collaboration.
Our study was approved by our institution’s IRB, and participants were fairly compensated and consented to data sharing with the research community.
References
- Analytica (2022)Astute Analytica. 2022.Mobilte tower defense games market - industry dynamics, market size, and opportunity forecast to 2030.https://www.astuteanalytica.com/industry-report/mobile-tower-defense-games-market.Accessed: 2 June 2024.
- Anderson etal. (1991)AnneH Anderson, Miles Bader, EllenGurman Bard, Elizabeth Boyle, Gwyneth Doherty, Simon Garrod, Stephen Isard, Jacqueline Kowtko, Jan McAllister, Jim Miller, etal. 1991.The hcrc map task corpus.Language and speech, 34(4):351–366.
- Andrews etal. (2019)JessicaJ. Andrews, Tanner Jackson, and Christopher Kurzum. 2019.Collaborative problem solving assessment in an online mathematics task.ETS Research Report Series, pages 1–7.
- Avery etal. (2011)Phillipa Avery, Julian Togelius, Elvis Alistar, and RobertPieter VanLeeuwen. 2011.Computational intelligence and tower defence games.In 2011 IEEE Congress of Evolutionary Computation (CEC), pages 1084–1091. IEEE.
- Bansal etal. (2019)Gagan Bansal, Besmira Nushi, Ece Kamar, WalterS. Lasecki, DanielS. Weld, and Eric Horvitz. 2019.Beyond accuracy: The role of mental models in human-ai team performance.In AAAI Conference on Human Computation & Crowdsourcing.
- Bell etal. (2018)SuzanneT Bell, ShaniqueG Brown, Anthony Colaneri, and Neal Outland. 2018.Team composition and the abcs of teamwork.American psychologist, 73(4):349.
- Bussone etal. (2015)Adrian Bussone, Simone Stumpf, and Dympna O’Sullivan. 2015.The role of explanations on trust and reliance in clinical decision support systems.In 2015 International Conference on Healthcare Informatics, pages 160–169.
- Butchibabu etal. (2016)Abhizna Butchibabu, Christopher Sparano-Huiban, Liz Sonenberg, and Julie Shah. 2016.Implicit coordination strategies for effective team communication.Human Factors, 58(4):595–610.PMID: 27113991.
- Cai etal. (2019)CarrieJ. Cai, Samantha Winter, DavidF. Steiner, Lauren Wilcox, and Michael Terry. 2019."hello ai": Uncovering the onboarding needs of medical practitioners for human-ai collaborative decision-making.Proceedings of the ACM on Human-Computer Interaction, 3:1 – 24.
- Care etal. (2015)Esther Care, Patrick Griffin, Claire Scoular, Nafisa Awwal, and Nathan Zoanetti. 2015.Collaborative problem solving tasks.Assessment and teaching of 21st century skills: Methods and approach, pages 85–104.
- Chen etal. (2021)Zi-Hang Chen, LiLin, Chen-Fei Wu, Chao-Feng Li, Rui-Hua Xu, and Ying Sun. 2021.Artificial intelligence for assisting cancer diagnosis and treatment in the era of precision medicine.Cancer Communications, 41(11):1100–1115.
- Clark etal. (2021)Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and NoahA. Smith. 2021.All that’s ‘human’ is not gold: Evaluating human evaluation of generated text.In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296, Online. Association for Computational Linguistics.
- Corral etal. (2021)ChristopherC Corral, KeerthiShrikar Tatapudi, Verica Buchanan, Lixiao Huang, and NancyJ Cooke. 2021.Building a synthetic task environment to support artificial social intelligence research.In Proceedings of the human factors and ergonomics society annual meeting, volume65, pages 660–664. SAGE Publications Sage CA: Los Angeles, CA.
- Dugan etal. (2022)Liam Dugan, Daphne Ippolito, Arun Kirubarajan, Sherry Shi, and Chris Callison-Burch. 2022.Real or fake text?: Investigating human ability to detect boundaries between human-written and machine-generated text.In AAAI Conference on Artificial Intelligence.
- Effenberger etal. (2021)Anna Effenberger, Rhia Singh, Eva Yan, Alane Suhr, and Yoav Artzi. 2021.Analysis of language change in collaborative instruction following.In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2803–2811, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Fan etal. (2018)Angela Fan, Mike Lewis, and Yann Dauphin. 2018.Hierarchical neural story generation.In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia. Association for Computational Linguistics.
- Fan etal. (2024)Zhihao Fan, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, Jun Xi, Fei Huang, and Jingren Zhou. 2024.Ai hospital: Interactive evaluation and collaboration of llms as intern doctors for clinical diagnosis.arXiv preprint arXiv:2402.09742.
- Flor etal. (2016)Michael Flor, Su-Youn Yoon, Jiangang Hao, Lei Liu, and Alina von Davier. 2016.Automated classification of collaborative problem solving interactions in simulated science tasks.In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pages 31–41, San Diego, CA. Association for Computational Linguistics.
- Fornaciari and Poesio (2014)Tommaso Fornaciari and Massimo Poesio. 2014.Identifying fake Amazon reviews as learning from crowds.In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 279–287, Gothenburg, Sweden. Association for Computational Linguistics.
- Freeman etal. (2021)Jared Freeman, Lixiao Huang, Matt Wood, and StephenJ Cauffman. 2021.Evaluating artificial social intelligence in an urban search and rescue task environment.In Aaai fall symposium, pages 72–84. Springer.
- Gazit etal. (2023)Lior Gazit, Ofer Arazy, and Uri Hertz. 2023.Choosing between human and algorithmic advisors: The role of responsibility sharing.Computers in Human Behavior: Artificial Humans, 1(2):100009.
- Graesser etal. (2018)ArthurC. Graesser, StephenM. Fiore, Samuel Greiff, Jessica Andrews-Todd, PeterW. Foltz, and FriedrichW. Hesse. 2018.Advancing the science of collaborative problem solving.Psychological Science in the Public Interest, 19(2):59–92.PMID: 30497346.
- Grimes etal. (2021)G.Mark Grimes, RyanM. Schuetzler, and JustinScott Giboney. 2021.Mental models and expectation violations in conversational ai interactions.Decision Support Systems, 144:113515.
- Hacker and von Ahn (2009)Severin Hacker and Luis von Ahn. 2009.Matchin: eliciting user preferences with an online game.In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’09, page 1207–1216, New York, NY, USA. Association for Computing Machinery.
- Hesse etal. (2015)Friedrich Hesse, Esther Care, Juergen Buder, Kai Sassenberg, and Patrick Griffin. 2015.A framework for teachable collaborative problem solving skills.Assessment and teaching of 21st century skills: Methods and approach, pages 37–56.
- Hill etal. (2003)RandallW. Hill, J.Gratch, Stacy Marsella, Jeff Rickel, W.Swartout, and DavidR. Traum. 2003.Virtual humans in the mission rehearsal exercise system.Künstliche Intell., 17:5–.
- Hoegl and Gemuenden (2001)Martin Hoegl and HansGeorg Gemuenden. 2001.Teamwork quality and the success of innovative projects: A theoretical concept and empirical evidence.Organization science, 12(4):435–449.
- Hollenbeck etal. (2004)JohnR Hollenbeck, DScott DeRue, and Rick Guzzo. 2004.Bridging the gap between i/o research and hr practice: Improving team composition, team training, and team task design.Human Resource Management: Published in Cooperation with the School of Business Administration, The University of Michigan and in alliance with the Society of Human Resources Management, 43(4):353–366.
- Huang etal. (2022)Lixiao Huang, Jared Freeman, Nancy Cooke, Samantha Dubrow, John“JCR” Colonna-Romano, Matt Wood, Verica Buchanan, Stephen Caufman, and Xiaoyun Yin. 2022.Artificial Social Intelligence for Successful Teams (ASIST) Study 2.
- Hutchins etal. (2008)SusanG Hutchins, Anthony Kendall, and Alex Bordetsky. 2008.Understanding patterns of team collaboration employed to solve unique problems.In Proceedings of the 13 th International Command and Control Research & Technology Symposium, pages 17–19.
- Jacobs etal. (2021)M.Jacobs, M.Pradier, T.McCoy, P.Roy, F.Doshi-Velez, and G.Krzysztof. 2021.How machine learning recommendations influence clinician treatment selections: example of antidepressant selection.Translational Psychiatry, 1:1–9.
- Kokel etal. (2022)Harsha Kokel, M.Das, Rakibul Islam, Julia Bonn, JonZ. Cai, Soham Dan, Anjali Narayan-Chen, Prashant Jayannavar, JanardhanRao Doppa, J.Hockenmaier, Sriraam Natarajan, Martha Palmer, and Dan Roth. 2022.Human-guided collaborative problem solving: A natural language based framework.ArXiv, abs/2207.09566.
- Kontogiannis and Kossiavelou (1999)Tom Kontogiannis and Zoe Kossiavelou. 1999.Stress and team performance: principles and challenges for intelligent decision aids.Safety science, 33(3):103–128.
- Lai etal. (2021)Vivian Lai, Chacha Chen, QingziVera Liao, Alison Smith-Renner, and Chenhao Tan. 2021.Towards a science of human-ai decision making: A survey of empirical studies.ArXiv, abs/2112.11471.
- Lai and Tan (2019)Vivian Lai and Chenhao Tan. 2019.On human predictions with explanations and predictions of machine learning models: A case study on deception detection.In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, page 29–38, New York, NY, USA. Association for Computing Machinery.
- Law etal. (2009)Edith Law, Luis von Ahn, and Tom Mitchell. 2009.Search war: a game for improving web search.In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP ’09, page31, New York, NY, USA. Association for Computing Machinery.
- Lee (2015)Jiwon Lee. 2015.Analysis of the refinement of shared mental model in science-gifted students’ collaborative problem solving process.Journal of the Korean Association for Research in Science Education, 35:1049–1062.
- Lu etal. (2024a)Bo-Ru Lu, Nikita Haduong, Chia-Hsuan Lee, Zeqiu Wu, Hao Cheng, Paul Koester, Jean Utke, Tao Yu, NoahA. Smith, and Mari Ostendorf. 2024a.Does collaborative human-lm dialogue generation help information extraction from human dialogues?Preprint, arXiv:2307.07047.
- Lu etal. (2024b)Zhuoran Lu, Dakuo Wang, and Ming Yin. 2024b.Does more advice help? the effects of second opinions in ai-assisted decision making.Proc. ACM Hum.-Comput. Interact., 8(CSCW1).
- Lykourentzou etal. (2016)Ioanna Lykourentzou, Angeliki Antoniou, Yannick Naudet, and StevenP. Dow. 2016.Personality matters: Balancing for personality types leads to better outcomes for crowd teams.In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, CSCW ’16, page 260–273, New York, NY, USA. Association for Computing Machinery.
- Ma etal. (2023)Yingbo Ma, Gloria AshiyaKatuka, Mehmet Celepkolu, and Kristy ElizabethBoyer. 2023.Automatically Predicting Peer Satisfaction During Collaborative Learning with Linguistic, Acoustic, and Visual Features.Journal of Educational Data Mining, 15(2).
- Marks etal. (2001)MichelleA. Marks, JohnE. Mathieu, and StephenJ. Zaccaro. 2001.A temporally based framework and taxonomy of team processes.The Academy of Management Review, 26(3):356–376.
- Masumura etal. (2018)Ryo Masumura, Tomohiro Tanaka, Atsushi Ando, Hirokazu Masataki, and Yushi Aono. 2018.Role play dialogue aware language models based on conditional hierarchical recurrent encoder-decoder.In Interspeech.
- Mathieu etal. (2014)JohnE Mathieu, ScottI Tannenbaum, JamieS Donsbach, and GeorgeM Alliger. 2014.A review and integration of team composition models: Moving toward a dynamic and temporal framework.Journal of management, 40(1):130–160.
- Mesbah etal. (2021)Neda Mesbah, Christoph Tauchert, and Peter Buxmann. 2021.Whose advice counts more - man or machine? an experimental investigation of ai-based advice utilization.In Hawaii International Conference on System Sciences.
- Ni etal. (2019)Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019.Justifying recommendations using distantly-labeled reviews and fine-grained aspects.In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 188–197, Hong Kong, China. Association for Computational Linguistics.
- OECD (2017)OECD. 2017.PISA 2015 collaborative problem-solving framework.OECD.
- Orasanu etal. (2004)Judith Orasanu, Ute Fischer, Yuri Tada, and Norbert Kraft. 2004.Team stress and performance: Implications for long-duration space missions.In Proceedings of the human factors and ergonomics society annual meeting, volume48, pages 552–556. SAGE Publications Sage CA: Los Angeles, CA.
- Park etal. (2023)JoonSung Park, JosephC. O’Brien, CarrieJ. Cai, MeredithRingel Morris, Percy Liang, and MichaelS. Bernstein. 2023.Generative agents: Interactive simulacra of human behavior.In In the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), UIST ’23, New York, NY, USA. Association for Computing Machinery.
- Pavez etal. (2022)Ignacio Pavez, Hugo Gómez, Canlong Liu, and VicenteA. González. 2022.Measuring project team performance: A review and conceptualization.International Journal of Project Management, 40(8):951–971.
- Paxton etal. (2021)Alexandra Paxton, JenniferM. Roche, Alyssa Ibarra, and MichaelK. Tanenhaus. 2021.Predictions of miscommunication in verbal communication during collaborative joint action.Journal of Speech, Language, and Hearing Research, 64(2):613–627.
- Pfaff (2012)MarkS Pfaff. 2012.Negative affect reduces team awareness: The effects of mood and stress on computer-mediated team communication.Human Factors, 54(4):560–571.
- Potts (2012)Christopher Potts. 2012.Goal-driven answers in the Cards dialogue corpus.In Proceedings of the 30th West Coast Conference on Formal Linguistics, Somerville, MA. Cascadilla Press.
- Proell etal. (2022)ChadA. Proell, Yuepin(Daniel) Zhou, and MarkW. Nelson. 2022.It’s Not Only What You Say … How Communication Style and Team Culture Affect Audit Issue Follow-Up and Auditor Performance Evaluations.The Accounting Review, 97(2):373–395.
- Rockenbach etal. (2007)Bettina Rockenbach, Abdolkarim Sadrieh, and Barbara Mathauschek. 2007.Teams take the better risks.Journal of Economic Behavior & Organization, 63(3):412–422.
- Rodrigues etal. (2021)MichelleA. Rodrigues, SiOn Yoon, Kathryn B.H. Clancy, and Elizabeth A.L. Stine-Morrow. 2021.What are friends for? the impact of friendship on communicative efficiency and cortisol response during collaborative problem solving among younger and older women.Journal of Women & Aging, 33(4):411–427.PMID: 34038325.
- Ruch etal. (2018)Willibald Ruch, Fabian Gander, Tracey Platt, and Jennifer Hofmann. 2018.Team roles: Their relationships to character strengths and job satisfaction.The Journal of Positive Psychology, 13(2):190–199.
- Savelsbergh etal. (2012)Chantal Savelsbergh, JosetteMP Gevers, BeatriceIJM Vander Heijden, and RobF Poell. 2012.Team role stress: Relationships with team learning and performance in project teams.Group & organization management, 37(1):67–100.
- Schelble etal. (2022)BeauG. Schelble, Christopher Flathmann, NathanJ. McNeese, Guo Freeman, and Rohit Mallick. 2022.Let’s think together! assessing shared mental models, performance, and trust in human-agent teams.Proc. ACM Hum.-Comput. Interact., 6(GROUP).
- Shaikh etal. (2023)Omar Shaikh, Caleb Ziems, William Held, AryanJ Pariani, Fred Morstatter, and Diyi Yang. 2023.Modeling cross-cultural pragmatic inference with codenames duet.arXiv preprint arXiv:2306.02475.
- Shore etal. (2018)Todd Shore, Theofronia Androulakaki, and Gabriel Skantze. 2018.KTH tangrams: A dataset for research on alignment and conceptual pacts in task-oriented dialogue.In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- Stewart etal. (2023)AngelaE.B. Stewart, Arjun Rao, Amanda Michaels, Chen Sun, NicholasD. Duran, ValerieJ. Shute, and SidneyK. D’Mello. 2023.Cpscoach: The design and implementation of intelligent collaborative problem solving feedback.In Artificial Intelligence in Education - 24th International Conference, AIED 2023, Proceedings, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pages 695–700, Germany. Springer Science and Business Media Deutschland GmbH.Publisher Copyright: © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.; 24th International Conference on Artificial Intelligence in Education, AIED 2023 ; Conference date: 03-07-2023 Through 07-07-2023.
- Suhr etal. (2019)Alane Suhr, Claudia Yan, Jack Schluger, Stanley Yu, Hadi Khader, Marwa Mouallem, Iris Zhang, and Yoav Artzi. 2019.Executing instructions in situated collaborative interactions.In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2119–2130, Hong Kong, China. Association for Computational Linguistics.
- Sun etal. (2020)Chen Sun, ValerieJ. Shute, Angela Stewart, Jade Yonehiro, Nicholas Duran, and Sidney D’Mello. 2020.Towards a generalized competency model of collaborative problem solving.Computers and Education, 143.
- Takmaz etal. (2020)Ece Takmaz, Mario Giulianelli, Sandro Pezzelle, Arabella Sinclair, and Raquel Fernández. 2020.Refer, Reuse, Reduce: Generating Subsequent References in Visual and Conversational Contexts.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4350–4368, Online. Association for Computational Linguistics.
- Tenbrink etal. (2017)Thora Tenbrink, Elena Andonova, Gesa Schole, and KennyR. Coventry. 2017.Communicative success in spatial dialogue: The impact of functional features and dialogue strategies.Language and Speech, 60(2):318–329.PMID: 28697700.
- Tsiros and Palladini (2020)Augoustinos Tsiros and Alessandro Palladini. 2020.Towards a human-centric design framework for ai assisted music production.In New Interfaces for Musical Expression.
- Vats etal. (2024)Vanshika Vats, MarziaBinta Nizam, Minghao Liu, Ziyuan Wang, Richard Ho, MohnishSai Prasad, Vincent Titterton, SaiVenkat Malreddy, Riya Aggarwal, Yanwen Xu, etal. 2024.A survey on human-ai teaming with large pre-trained models.arXiv preprint arXiv:2403.04931.
- von Ahn (2013)Luis von Ahn. 2013.Duolingo: learn a language for free while helping to translate the web.In Proceedings of the 2013 International Conference on Intelligent User Interfaces, IUI ’13, page 1–2, New York, NY, USA. Association for Computing Machinery.
- von Ahn etal. (2006)Luis von Ahn, Mihir Kedia, and Manuel Blum. 2006.Verbosity: a game for collecting common-sense facts.In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’06, page 75–78, New York, NY, USA. Association for Computing Machinery.
- Wang etal. (2023)Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. 2023.Unleashing cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration.arXiv preprint arXiv:2307.05300.
- Wei etal. (2023)Jimmy Wei, Kurt Shuster, Arthur Szlam, Jason Weston, Jack Urbanek, and Mojtaba Komeili. 2023.Multi-party chat: Conversational agents in group settings with humans and models.ArXiv, abs/2304.13835.
- Wiltshire etal. (2018)TravisJ Wiltshire, JonathanE Butner, and StephenM Fiore. 2018.Problem-solving phase transitions during team collaboration.Cognitive science, 42(1):129–167.
- Zarrieß etal. (2016)Sina Zarrieß, Julian Hough, Casey Kennington, Ramesh Manuvinakurike, David DeVault, Raquel Fernández, and David Schlangen. 2016.PentoRef: A corpus of spoken references in task-oriented dialogues.In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 125–131, Portorož, Slovenia. European Language Resources Association (ELRA).
- Zhang etal. (2023)Guanglu Zhang, Leah Chong, Kenneth Kotovsky, and Jonathan Cagan. 2023.Trust in an ai versus a human teammate: The effects of teammate identity and performance on human-ai cooperation.Computers in Human Behavior, 139:107536.
- Zhang etal. (2021)Rui Zhang, NathanJ McNeese, Guo Freeman, and Geoff Musick. 2021." an ideal human" expectations of ai teammates in human-ai teaming.Proceedings of the ACM on Human-Computer Interaction, 4(CSCW3):1–25.
Appendix A Tower Defense Designs
Currently implemented tower defense designs that can be adjusted to suit the specified CPS task are as follows.
- 1.
Communication: Voice (bool), push-to-talk (bool), text chat (bool)
- 2.
Description visibility: Tower name (bool), tower description (bool)
- 3.
Number of rounds per level (int)
- 4.
Player resources: Money (shared or individual), health and Score (shared)
- 5.
Interactability during attack phase (bool). Enable this to allow adjusting tower placement and upgrading towers during the dynamic attack phase.
- 6.
Towers: We provide 12 custom towers with unique mechanics and effects. Information about towers (name, description) can be customized. The unique towers are: basic, poison (damage over time), piercing (damage multiple enemies in a straight line), splash (area damage), obstacle (spawn an object on the track that does damage when enemies walk over it), slow (slows enemies), fear (enemies go backwards along the track), sniper (does more damage to faster enemies), discount (lowers upgrade costs of nearby towers), support (buffs all stats for nearby towers), multishot (shoots in 4 directions).
- 7.
Levels: A level design designates how enemies spawn, the enemy movement paths, the location of a base that players defend, terrain for where towers can be placed, starting gold and health, and which towers are available to players.
- 8.
Enemies: There are enemy variants that differ in health, movement speed, point value when destroyed, and money given to players when destroyed.
We expect to implement other common game design paradigms such as segmenting the map so players can only place towers on their designated section as the platform matures.
Appendix B Case study results
Table4 describes our case study in the context of other tasks with open data.
Teams | Participants | Team Size | Tokens | Size | Repetitions | Round Dur. | Study Dur. | Recruitment Platform | |
TEAMS | 63 | 252 | 3–4 | 573k | 110k utterances | 2 | 30min | 1.5hr | Local |
ASIST | 64 | 192 | 3 | — | — | 2 | 15min | 3.5hrs | Online, Local |
CerealBar | N/A | 264 | 2 | 325k | 24k utterances | N/A | 16.5min | — | Crowdworker |
PhotoBook | N/A | 1,514 | 2 | 984k | 164.6k utterances | N/A | — | 14.2m | Crowdworker |
HCRC map task | 32 | 64 | 2 | 150k | 18hrs | 4 | — | — | School |
PentoRef | 63 | 127 | 2 | 216.3k | 23k utterances | — | — | — | — |
KTHTangrams | 42 | 84 | 2 | 68k | 11hrs/15k utterances | — | — | 15min | Local |
Cards | N/A | — | 2 | 282k | 45,805 utterances | N/A | 8.5min | — | Crowdworker |
CPS-TaskForge Pilot | 8 | 35 | 3–4 | 8k | 4k utterances | 9 | 4-6min | 1.5hr | Local |
Sample conversations are in Table5.
— Level 1 Round 1 — |
Mundert: no slow :( |
Mundert: spam damage? |
oobma: sure |
Mundert: oh wait |
oobma: we got different towers |
Mundert: we have different towers |
TommyVCT: I guess just yolo it |
omar: yeah |
Mundert: ok mine only do damage |
TommyVCT: I have the one that makes enemies sluggish |
TommyVCT: looks like we got a lot of money |
omar: mine only do damage too |
TommyVCT: oops nevermind we are broke lol |
Mundert: easy win |
oobma: gogo? |
omar: lets go |
TommyVCT: gogogo |
TommyVCT: it’s funny that they went backwards |
Mundert: oh it looks like we can kill box with the tree that frightens enimies |
Mundert: and the vine one |
omar: we probably went overboard lol |
Mundert: and area damage would be good with that too |
TommyVCT: ez |
omar: probably should save money next time to get higher score |
— Level 1 Round 2 — |
Mundert: wait if we lose do we still get a score |
omar: its the same enemies right? |
TommyVCT: looks like it’s the same |
omar: lets have the same setup at the start and nothing after |
omar: to save money |
Mundert: ok christmas tree and vine killbox? |
TommyVCT: I got the same roll of the tools too |
Mundert: whatever the cannon was for area damage? |
Mundert: spam em |
omar: who has the cannons? |
oobma: was it the cannon? i only had 1 i thought |
oobma: pretty sure it was the plant thing |
omar: sorry the catapult |
omar: its missing here |
Mundert: cannon does area damage |
TommyVCT: I’ll try to deter the enemies using the diamond |
Mundert: so we should use that for a killbox |
Mundert: single target is kinda bad for a killbox |
Mundert: so im not placing my catapults if we do that |
oobma: how many cannons then |
oobma: 4 more? |
omar: maybe 2? |
Mundert: sure |
Mundert: hoewver we can afford and more trees and vines too right |
TommyVCT: wait |
TommyVCT: should I sell my diamonds? |
Mundert: maybe those crossbow things in the line as well |
Mundert: not all |
Mundert: right |
Mundert: because slow is also good |
omar: sell the diamonds in tile (8,9) and (8,8) |
oobma: imo the cross bows would be good at 8,9 |
oobma: and 8,8 |
omar: ill putt a cross bw there |
Mundert: agree |
TommyVCT: That’s all I got |
Mundert: > |
Mundert: ? |
TommyVCT: The tank or controller like thingy is for faster emenies |
Mundert: wait why is the tank there |
omar: but could you sell tile 8,9? |
TommyVCT: oh I put there |
omar: crossbow is better there |
Mundert: agree |
Mundert: aight |
Mundert: nice |
omar: much better |
Mundert: i dont think we need the tank |
TommyVCT: yeah it’s kinda useless |
Mundert: more tree and vine and other such area of affect towers |
<speaker>tjwill</speaker> <chat_text>Full map ones we probably want bottom left </chat_text> |
<action>BUY</action> <tower_type>DISCOUNT</tower_type> <location>(10, 0)</location> <user>ManedWlf</user> |
<speaker>tjwill</speaker> <chat_text>If you do a 3x3 grid, empty the center and I’ll put an upgrade gem. </chat_text> |
<action>BUY</action> <tower_type>MULTI</tower_type> <location>(13, 5)</location> <user>schou01</user> |
<action>BUY</action> <tower_type>MAP</tower_type> <location>(0, 14)</location> <user>ManedWlf</user> |
<action>BUY</action> <tower_type>MAP</tower_type> <location>(0, 15)</location> <user>ManedWlf</user> |
<action>BUY</action> <tower_type>MAP</tower_type> <location>(0, 13)</location> <user>ManedWlf</user> |
<action>BUY</action> <tower_type>MAP</tower_type> <location>(1, 13)</location> <user>ManedWlf</user> |
<speaker>tjwill</speaker> <chat_text>Then we want a discount tower on the outside, upgrades are Sponsive! </chat_text> |
<action>BUY</action> <tower_type>MAP</tower_type> <location>(2, 13)</location> <user>ManedWlf</user> |
<action>BUY</action> <tower_type>SUPPORT</tower_type> <location>(1, 14)</location> <user>tjwill</user> |
<action>BUY</action> <tower_type>MAP</tower_type> <location>(1, 15)</location> <user>ManedWlf</user> |
<action>BUY</action> <tower_type>MAP</tower_type> <location>(2, 15)</location> <user>ManedWlf</user> |
<action>BUY</action> <tower_type>MAP</tower_type> <location>(2, 14)</location> <user>ManedWlf</user> |
<action>BUY</action> <tower_type>MAP</tower_type> <location>(1, 12)</location> <user>ManedWlf</user> |
<speaker>schou01</speaker> <chat_text>where do we want to focus our offense? </chat_text> |
Appendix C Survey questions
The pre-survey collected basic demographic information.
The post-survey contained the Teamwork Quality questionnaire(Hoegl and Gemuenden, 2001), VIA Team roles inventory(Ruch etal., 2018), and an open-ended task-specific questionnaire. Both TWQ and VIA used a 7-point Likert scale with options: Strongly Disagree, Disagree, Somewhat Disagree, Neutral, Somewhat Agree, Agree, and Strongly Agree.
C.1 TWQ
- •
Communication
- –
There was frequent communication within the team
- –
The team members communicated mostly directly and personally with each other.
- –
There were mediators through whom much communication was conducted.
- –
Project-relevant information was shared openly by all team members
- –
Important information was kept away from other team members in certain situations.
- –
In our team there were conflicts regarding the openness of the information flow.
- –
The team members were happy with the timeliness in which they received information from other team members
- –
The team members were happy with the precision of the information received from other team members
- –
The team members were happy with the usefulness of the information received from other team members
- –
- •
Coordination
- –
The work done on subtasks within the project was closely harmonized.
- –
There were clear and fully comprehended goals for subtasks within our team.
- –
The goals for subtasks were accepted by all team members.
- –
There were conflicting interests in our team regarding subtasks/subgoals.
- –
- •
Mutual Support
- –
The team members helped and supported each other as best they could.
- –
If conflicts came up, they were easily and quickly resolved
- –
Discussions and controversies were conducted constructively.
- –
Suggestions and contributions of teammembers were respected
- –
Suggestions and contributions of team members were discussed and further developed.
- –
Our team was able to reach consensus regarding important issues.
- –
- •
Effectiveness
- –
Going by the results, this project can be regarded as successful.
- –
The team was satisfied with the project result.
- –
Open-response questions:
- •
What went well during the game?
- •
What went poorly during the game?
- •
Any notable communication difficulties or frustrations? If they were resolved, how did you resolve them?
- •
Any notable joyous or satisfactory communications?
- •
Suppose you played the game again with different maps but the same set of players. What would you change?
- •
(Optional) Any other comments or complaints about your teamwork or communication?
C.2 VIA Team roles
Instructions for participants: for every role, read the description and answer the questions, imagining that you are currently in your ideal team.
- •
Idea Creator. When working in a team, the creation of new ideas to come up with a solution for a difficult problem or task is essential. Thereby, Idea Creators are people with unconventional ways of coming to solutions and great ideas.
- –
In my ideal team, I’m at my best when coming up with ideas.
- –
I enjoy creating ideas within my ideal team
- –
I am able to be a great idea creator within my ideal team
- –
I have a feeling of energized focus when coming up with ideas within my ideal team
- –
It makes me feel good to create ideas in my ideal team
- –
- •
Information Gatherer. Information Gatherers search for information, for example on topics as best practices, new trends, potential vendors, competition, and so forth.
- –
In my ideal team, I’m at my best when gathering information
- –
I enjoy gathering information within my ideal team
- –
I am able to be a great information gatherer within my ideal team
- –
I have a feeling of energized focus when gathering information within my ideal team
- –
It makes me feel good to gather information within my ideal team
- –
- •
Decision Maker. Decision Makers are processing all the information at hand, integrating it to make the best possible decision and clarifying the goals.
- –
In my ideal team, I’m at my best when making decision
- –
I enjoy making decisions within my ideal team
- –
I am able to be a great decision maker within my ideal team
- –
I have a feeling of energized focus when making decisions within my ideal team
- –
It makes me feel good to make decisions within my ideal team
- –
- •
Implementer. Once a team has arrived at a decision on its direction, it needs to implement it. Thereby the Implementer constantly controls the current status and takes measures to work towards the goal.
- –
In my ideal team, I’m at my best when implementing goals
- –
I enjoy implementing goals within my ideal team
- –
I am able to be a great implementer in my ideal team
- –
I have a feeling of energized focus when implementing goals in my ideal team
- –
It makes me feel good to implement goals in my ideal team
- –
- •
Influencer. Commonly, the work product of the team needs to be presented by the Influencer for acceptance internally (supervisors, administrators) and/or externally (customers). This is a process of influencing and being persuasive.
- –
I’m at my best when representing the work/opinion of the team and convincing others of it
- –
As a member of my ideal team, I enjoy representing the work/opinion of the team and convincing others of it
- –
I am able to be a great influencer in my ideal team
- –
I have a feeling of energized focus when representing the work/opinion of my ideal team and when convincing others of it
- –
It makes me feel good to represent the work/opinion of my ideal team and convince others of it
- –
- •
Energizer. In the process of getting work done, Energizers are people that infuse energy into the work and others. Teams without enough energy can fall flat and struggle during times of pressure or prolonged projects that require endurance.
- –
In my ideal team, I’m at my best when energizing
- –
I enjoy energizing within my ideal team
- –
I am able to be a great energizer within my ideal team
- –
When I focus on infusing energy into work and others of my ideal team, I feel energized too
- –
It makes me feel good to energize within my ideal team
- –
- •
Relationship Manager. Since the working of a team is a dynamic interplay of people and their relationships, the Relationship Manager helps to run relationships smoothly and to resolve conflicts.
- –
In my ideal team, I’m at my best when managing relationships
- –
I enjoy managing relationships within my ideal team
- –
I am able to be a great relationship manager within my ideal team
- –
I have a feeling of energized focus when I manage relationships within my ideal team
- –
It makes me feel good to manage relationships within my ideal team
- –
Appendix D CPS classification
The CPS skill taxonomy used for classifying utterances in the CPS pilot reproduced from Andrews etal. (2019):
- 1.
Sharing information. Content relevant information communicated during collaboration and includes sharing one’s own information, sharing task or resource information, and sharing understanding
- 2.
Maintaining communication. Content irrelevant social communication and includes general off-topic communication, rapport-building communication, and inappropriate communication
- 3.
Establishing shared understanding. Communication in the service of attempting to learn the perspective of others and trying to establish that what has been said is understood.
- 4.
Negotiating. Communication used to express agreement or disagreement and to attempt to resolve conflicts when they arise
- 5.
Exploring and understanding. Actions in the task environment to explore and understand the problem space.
- 6.
Representing and formulating. Actions and communication used to build a coherent mental representation of the problem and formulate hypotheses
- 7.
Planning. Communication used to develop a strategy or plan to solve the problem
- 8.
Executing actions. Actions and communication used in the service of carrying out a plan (e.g., enacting a strategy or communicating to teammates actions one is taking to carry out the plan).
- 9.
Monitoring. Actions and communication used to monitor progress toward the goal and monitor the team’s organization
D.1 Annotation challenges
Annotating the data for CPS skill using the taxonomy developed by Andrews etal. (2019) was challenging because labels did not have a clear distinction, which contributed to the relatively low inter-annotator agreement.
For example, consider the following snippet:
⬇
(1) ManedWlf: I have a basic tower with a range of 22, fire rate of 0.8
(2) ManedWlf: Shall I place a couple close to the castle?
(3) tjwill: Looks like we’ve got the same ones to start with, and sounds good!
When ManedWlf describes the basic tower in (1), we can label the utterance for sharing information because it is sharing resource information.In (2), a plan is proposed to place some basic towers near the castle, which we can label for planning.In (3), we have an observation about both players having the same basic tower. This could be labeled for sharing information because tjwill is sharing information about having access to the same basic tower. It could also be labeled representing and formulating because tjwill is building a mental representation about how everyone has the same starting towers.
We defined a few soft rule for classification to help with annotation consistency, but we suggest future work should investigate designing a more complex taxonomy with clearer distinctions between labels.
Soft rules used when manually classifying CPS skills:
- •
If a player asks for opinions about placing towers or making upgrades, classify it as Planning.
- •
If players agree to a plan, classify as Negotiating even if it’s just “ok” because it is expressing agreement about a plan proposal.
- •
If a plan is proposed and another player proposes an alternative or disagrees, classify as Negotiation.
- •
Representing and formulating is about understanding the efficacy of towers or strategy enacted, e.g., “the blue tower seems to slow enemies down”
- •
If a player asks someone else to do something, classify as Planning because it is working towards developing the strategy.
D.2 Prompt
We tried using automatic annotation with GPT-4, but annotation agreement with both authors was only 55%, and developing a CPS classification model with higher accuracy is beyond the scope of this work.We list the prompt prefix used for documentation purposes. We used the prompt prefix to classify batches of 6 utterances.
⬇
CPS skills list:
<skill>Sharing information</skill>. content relevant information communicated during collaboration and includes sharing one’s own information, sharing task or resource information, and sharing understanding
<skill>Maintaining communication</skill>. content irrelevant social communication and includes general off-topic communication, rapport-building communication, and inappropriate communication
<skill>Establishing shared understanding</skill>. communication in the service of attempting to learn the perspective of others and trying to establish that what has been said is understood.
<skill>Negotiating</skill>. communication used to express agreement or disagreement and to attempt to resolve conflicts when they arise
<skill>Representing and formulating</skill>. actions and communication used to build a coherent mental representation of the problem and formulate hypotheses
<skill>Planning</skill>. communication used to develop a strategy or plan to solve the problem
<skill>Executing actions</skill>. actions and communication used in the service of carrying out a plan (e.g., enacting a strategy or communicating to teammates actions one is taking to carry out the plan).
<skill>Monitoring</skill>. actions and communication used to monitor progress toward the goal and monitor the team’s organization
You are given a numbered list of inputs. For each input:
Step 1: classify the <chat_text> for one or more <skills> displayed
Step 2: Explain your reasoning in <reason> tags.
Inputs
1. <speaker>ym2552</speaker> <chat_text>It’s just when they come in big groups that’s worrying, as it seems most towers can only focus on </chat_text>
2. <speaker<schou1</speaker> <chat_text>any chance we can get a buff or discount tower at 9,4?</chat_text>
3. <speaker>jane</speaker> <chat_text>willdo</chat_text>
4. <speaker>paul</speaker> <chat_text>hell, even 1 more turret near the bottom probably would’ve gotten them all, but we’re doing good</chat_text>
Outputs
1. <skill>Representing and formulating</skill>
<reason>The speaker is explaining that when a lot of enemies come at once, they worry the towers will be overwhelmed.</reason>
2. <skill>Planning</skill>
<reason>The speaker is asking another player to place a buff or discount tower at a specific location to further develop the solution</reason>
3. <skill>Executing actions</skill>
<reason>the player is acknowledging a request to act, showing they will execute an action</reason>
4. <skill>Representing and formulating</skill><skill>Maintaining communication</skill>
<reason>the player hypothesizes having one more turret near the bottom would have helped the strategy, then comments the team is doing well to build rapport.</reason>
---
Inputs
Appendix E Potential CPS-TaskForge Tasks
We decided to use the tower defense game genre as the task for CPS-TaskForge after considering several other games.
- 1.
Pandemic™ board game. We found valuable play by forum games that demonstrated the type of multi-turn collaborative communication we hope to see in CPS data. However, one instance of the game takes at minimum 30 minutes to complete, making it challenging to evaluate intermediate task process. The lengthy duration is also a barrier to task repetition within a single study session.
- 2.
Cryptic Crossword puzzles. The cryptic crossword puzzle variant relies on metahints and wordplay, making it more accessible than regular crosswords that require trivia knowledge. However, learning the rules is difficult. Participants required 2–3 hours to understand the rules in pilot tests. The communication during the task was also often short utterances suggesting the solution, with reasoning provided only if teammates requested.
Appendix F License
The Godot game engine has an MIT license.The terms for use of our artifactswill be included in our released package.