As Internet-of-Things (IoT) devices become pervasive in our everyday lives many opportunities arise to provide added-value services and convenience to society. In order to achieve that, devices must collaborate together to obtain a better understanding of their surroundings and seamlessly act as if they were all part of a single system. For instance, a gym surveillance camera could provide your fitness tracker with enhanced running analytics, and the treadmill could automatically stream audio updates to your headphones as you run. However, for that to happen they would need to identify you and the devices you carry, so you do not receive updates from somebody else. Numerous potential applications, such as passive device pairing, autonomous
retail stores or automatic access control, rely on object
identification. Existing methods, such as user input (e.g. QR code) or data-driven (e.g. fingerprint) approaches, can be used to assist this identication process through
context-aware sensing, but often require user active intent or pre-calibration, both of which become infeasible for the expected average number of devices. Moreover, the wide range of sensing modalities present in the IoT ecosystem (vision, acceleration, vibration, weight...) further hinders the accurate identification of objects. Therefore, with the increase in device density and the growing diversity of their sensing and computational capabilities, it becomes increasingly important to automate this collaborative multi-modal information aggregation process. This thesis explores the collaborative multi-modal object identification problem. It presents a generic framework to tackle this problem and introduces a methodology where physical knowledge is utilized across different stages of the information acquisition process. Through common observation and analysis of objects’ motion, shared feature extraction, collaborative multi-modal feature fusion and data quality improvement feedback loops are presented as ways to enhance object identification performance. As a result, this approach significantly reduces the amount of training data required while achieving high identification accuracy, even in fully passive and unconstrained environments. The methodology is demonstrated by focusing on vision and kinematic sensing based collaborative identification, given that those two are among the most prevalent sensing categories in the IoT domain. To pair wearable devices, our motion matching approach achieves up to 92.2% identification accuracy even when a user wears 13 different devices simultaneously on different parts of their body. For autonomous retail stores, we correctly predicted what items were taken by customers 93.2% of the time over nearly 400 shopping events. For drone identfiication in a swarm where all drones look identical, we achieve a 3x improvement in identification time (7s for 24 drones) and 9x improvement in survival rate (92% did not crash).