NEW RESEARCH
PolySem: Efficient Polyglot Analytics on Semantic Data
Data scientists and data engineers spend a large portion of their time trying to understand, clean and transform their data before they can even start performing meaningful analysis. Most database vendors provide business intelligence (BI) tools as an efficient and user-friendly platform for customers to perform data cleaning, preparation and linking tasks to obtain actionable semantic data. However, customers are increasingly interested in querying semantic data through various modalities including SQL, imperative programming languages such as Python, and natural language queries. Today, customers are limited to using either the visual interfaces provided by these tools or languages that are specific to the particular tool.
In a new paper: PolySem: Efficient Polyglot Analytics on Semantic Data, researchers from Microsoft propose techniques to enable the execution of user queries expressed in different modalities on semantic datasets without having to export data out of the BI system. Their techniques include automatic translation of user queries into a language-agnostic representation of data processing operations, and subsequently into the specific query language that is amenable to execution on the BI engine. Evaluation results on BI and decision support benchmarks suggest significant improvements in query performance compared to other popular data processing engines.
Spotlight: blog post
NEW RESOURCE
Generative retrieval for conversational question answering
The growth of conversational agents, including voice assistants and chatbots, has led to a shift towards dialogue-based interfaces for information-seeking activities. This has spurred the development of conversational question answering (QA) systems. Effective passage retrieval, which excludes irrelevant data from scanned documents, is crucial but challenging for such systems due to the ambiguity of questions. Current methods rely on the dual-encoder architecture to embed contextualized vectors of questions in conversations. However, this architecture is limited in the embedding bottleneck and the dot-product operation.
To alleviate these limitations, researchers from Microsoft propose generative retrieval for conversational QA (GCoQA). GCoQA assigns distinctive identifiers for passages and retrieves passages by generating their identifiers token-by-token via the encoder–decoder architecture. In this generative way, GCoQA eliminates the need for a vector-style index and could attend to crucial tokens of the conversation context at every decoding step. Experiments on three public datasets containing about twenty million passages show GCoQA achieves relative improvements of +13.6% in passage retrieval and +42.9% in document retrieval. GCoQA also reduces memory usage and improves inference speed.
NEW RESOURCE
BatteryML: An open-source tool for machine learning on battery degradation
In recent years, lithium-ion batteries have become the cornerstone of energy storage solutions, owing to their high energy density, long cycle life, and relatively low self-discharge. They have found widespread applications across various industries, including electric vehicles, consumer electronics, and renewable energy systems. Despite these advantages, lithium-ion batteries face challenges related to capacity degradation and performance optimization, which have become critical areas of focus in battery research.
Capacity degradation is a complex process influenced by various factors such as temperature, charge-discharge rate, and state of charge. Understanding and mitigating these factors is crucial for enhancing the performance and longevity of lithium-ion batteries. This has led to the development of advanced battery management systems and the application of machine learning techniques to improve prediction accuracy and optimize battery performance.
To address these challenges, researchers from Microsoft have released BatteryML (opens in new tab), a comprehensive open-source tool designed specifically for machine learning researchers, battery scientists, and materials researchers with an interest in battery performance prediction and analysis. BatteryML aims to address the challenges of capacity degradation by leveraging machine learning methods to improve various aspects of battery performance, such as capacity fade modeling, state of health prediction, and state of charge estimation.