On the effective transfer of knowledge from English to Hindi Wikipedia
Das P., Roy A., Chakraborty R., Mukherjee A.
Conference paper, Proceedings - International Conference on Computational Linguistics, COLING, 2025,
View abstract ⏷
Although Wikipedia is the largest multilingual encyclopedia, it remains inherently incomplete. There is a significant disparity in the quality of content between high-resource languages (HRLs, e.g., English) and low-resource languages (LRLs, e.g., Hindi), with many LRL articles lacking adequate information. To bridge these content gaps we propose a lightweight framework to enhance knowledge equity between English and Hindi. In case the English Wikipedia page is not up-to-date, our framework extracts relevant information from external resources readily available (such as English books), and adapts it to align with Wikipedia’s distinctive style, including its neutral point of view (NPOV) policy, using in-context learning capabilities of large language models. The adapted content is then machine-translated into Hindi for integration into the corresponding Wikipedia articles. On the other hand, if the English version is comprehensive and up-to-date the framework directly transfers knowledge from English to Hindi. Our framework effectively generates new content for Hindi Wikipedia sections, enhancing Hindi Wikipedia articles respectively by 65% and 62% according to automatic and human judgment-based evaluations.
REVERSUM: A Multi-staged Retrieval-Augmented Generation Method to Enhance Wikipedia Tail Biographies through Personal Narratives
Adak S., Meher P.M., Das P., Mukherjee A.
Conference paper, Proceedings - International Conference on Computational Linguistics, COLING, 2025,
View abstract ⏷
Wikipedia is an invaluable resource for factual information about a wide range of entities. However, the quality of articles on less-known entities often lags behind that of the well-known ones. This study proposes a novel approach to enhancing Wikipedia’s B and C category biography articles by leveraging personal narratives such as autobiographies and biographies. By utilizing a multi-staged retrieval-augmented generation technique – REVERSUM – we aim to enrich the informational content of these lesser-known articles. Our study reveals that personal narratives can significantly improve the quality of Wikipedia articles, providing a rich source of reliable information that has been underutilized in previous studies. Based on crowd-based evaluation, REVERSUM generated content outperforms the best performing baseline by 17% in terms of integrability to the original Wikipedia article and 28.5% in terms of informativeness.
Diversity matters: Robustness of bias measurements in Wikidata
Das P., Karnam S.K., Panda A., Guda B.P.R., Sarkar S., Mukherjee A.
Conference paper, ACM International Conference Proceeding Series, 2023, DOI Link
View abstract ⏷
With the widespread use of knowledge graphs (KG) in various automated AI systems and applications, it is very important to ensure that information retrieval algorithms leveraging them are free from societal biases. Previous works have depicted biases that persist in KGs, as well as employed several metrics for measuring the biases. However, such studies lack in the systematic exploration of the sensitivity of the bias measurements, through varying sources of data, or the embedding algorithms used. To address this research gap, in this work, we present a holistic analysis of bias measurement on the knowledge graph. First, we attempt to reveal data biases that surface in Wikidata for thirteen different demographics selected from seven continents. Next, we attempt to unfold the variance in the detection of biases by two different knowledge graph embedding algorithms-TransE and ComplEx. We conduct our extensive experiments on a large number of occupations sampled from the thirteen demographics with respect to the sensitive attribute, i.e., gender. Our results show that the inherent data bias that persists in KG can be altered by specific algorithm bias as incorporated by KG embedding learning algorithms. Further, we show that the choice of the state-of-the-art KG embedding algorithm has a strong impact on the ranking of biased occupations irrespective of gender. In particular, we find that the embedding algorithm ComplEx is more robust to the choice of demographics compared to TransE. Subsequently, we observe that the similarity of the biased occupations across demographics is minimal which reflects the socio-cultural differences around the globe. This is often overlooked by most of the coarse-grained approaches working at the aggregate level. We believe that this full-scale audit of the bias measurement pipeline will raise awareness among the community while deriving insights related to design choices of data and algorithms both and refrain itself from the popular dogma of "one-size-fits-all".
Mining the online infosphere: A survey
Adak S., Chakraborty S., Das P., Das M., Dash A., Hazra R., Mathew B., Saha P., Sarkar S., Mukherjee A.
Review, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2022, DOI Link
View abstract ⏷
The evolution of Artificial Intelligence (AI)-based systems and applications have pervaded everyday life to make decisions that have a momentous impact on individuals and society. With the staggering growth of online data, often termed as the online infosphere, it has become paramount to monitor the infosphere to ensure social good as AI-based decisions are severely dependent. This survey aims to provide a comprehensive review of some of the most important research areas related to the infosphere, focusing on the technical challenges and potential solutions. The survey also outlines some of the important future directions. We begin by focussing on the collaborative systems that have emerged within the infosphere with a special thrust on Wikipedia. In the follow-up, we demonstrate how the infosphere has been instrumental in the growth of scientific citations and collaborations, thus fuelling interdisciplinary research. Finally, we illustrate the issues related to the governance of the infosphere, such as the tackling of the (a) rising hateful and abusive behavior and (b) bias and discrimination in different online platforms and news reporting. This article is categorized under: Application Areas > Internet.
Quality Change: Norm or Exception? Measurement, Analysis and Detection of Quality Change in Wikipedia
Das P., Guda B.P.R., Seelaboyina S.B., Sarkar S., Mukherjee A.
Article, Proceedings of the ACM on Human-Computer Interaction, 2022, DOI Link
View abstract ⏷
Wikipedia has been turned into an immensely popular crowd-sourced encyclopedia for information dissemination on numerous versatile topics in the form of subscription free content. It allows anyone to contribute so that the articles remain comprehensive and updated. For enrichment of content without compromising standards, the Wikipedia community enumerates a detailed set of guidelines, which should be followed. Based on these, articles are categorized into several quality classes by the Wikipedia editors with increasing adherence to guidelines. This quality assessment task by editors is laborious as well as demands platform expertise. As a first objective, in this paper, we study evolution of a Wikipedia article with respect to such quality scales. Our results show novel non-intuitive patterns emerging from this exploration. As a second objective we attempt to develop an automated data driven approach for the detection of the early signals influencing the quality change of articles. We posit this as a change point detection problem whereby we represent an article as a time series of consecutive revisions and encode every revision by a set of intuitive features. Finally, various change point detection algorithms are used to efficiently and accurately detect the future change points. We also perform various ablation studies to understand which group of features are most effective in identifying the change points. To the best of our knowledge, this is the first work that rigorously explores English Wikipedia article quality life cycle from the perspective of quality indicators and provides a novel unsupervised page level approach to detect quality switch, which can help in automatic content monitoring in Wikipedia thus contributing significantly to the CSCW community.
When Expertise Gone Missing: Uncovering the Loss of Prolific Contributors in Wikipedia
Das P., Guda B.P.R., Chakraborty D., Sarkar S., Mukherjee A.
Conference paper, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2021, DOI Link
View abstract ⏷
Success of planetary-scale online collaborative platforms such as Wikipedia is hinged on active and continued participation of its voluntary contributors. The phenomenal success of Wikipedia as a valued multilingual source of information is a testament to the possibilities of collective intelligence. Specifically, the sustained and prudent contributions by the experienced prolific editors play a crucial role to operate the platform smoothly for decades. However, it has been brought to light that growth of Wikipedia is stagnating in terms of the number of editors that faces steady decline over time. This decreasing productivity and ever increasing attrition rate in both newcomer and experienced editors is a major concern for not only the future of this platform but also for several industry-scale information retrieval systems such as Siri, Alexa which depend on Wikipedia as knowledge store. In this paper, we have studied the ongoing crisis in which experienced and prolific editors withdraw. We performed extensive analysis of the editor activities and their language usage to identify features that can forecast prolific Wikipedians, who are at risk of ceasing voluntary services. To the best of our knowledge, this is the first work which proposes a scalable prediction pipeline, towards detecting the prolific Wikipedians, who might be at a risk of retiring from the platform and, thereby, can potentially enable moderators to launch appropriate incentive mechanisms to retain such ‘would-be missing’ valued Wikipedians.
Accessibility metric for characterizing the relevance of conference papers
Das P., Adhikari A., Mukherjee A.
Conference paper, Proceedings of 2019 IEEE Region 10 Symposium, TENSYMP 2019, 2019, DOI Link
View abstract ⏷
Academic conference is a medium for rapid dissemination of knowledge. In an expanding horizon, often it becomes difficult to achieve proper impact due to topic mismatch. The problem affects either way, both conference organizers as well as the prospective participants. As a result, some of the works remain disconnected and disoriented from the majority volume of work presented in the conference. A solution to such situation is sought here by establishing better cohesion. The papers are modelled as nodes of a graph and these are connected through edges if they share a common keyword, specified during submission. An accessibility metric over this network is proposed in this work as being capable of judging the relevance of a paper. Two case studies are presented for proof of concept.
Generating a representative keyword subset pertaining to an academic conference series
Adhikari A., Das P., Mukherjee A.
Article, Scientometrics, 2019, DOI Link
View abstract ⏷
The breadth and velocity of innovation has resulted in explosion of research documents day by day. Academic conferences are being arranged worldwide, most of them in regular intervals, thereby generating a huge volume of research documents. Extracting undiscovered knowledge from the conference papers and thereby finding the inter-relationship of conference research topics is a challenging task. This paper attempts towards knowledge discovery for the conference with the help of keywords mentioned in the papers presented therein. The scheme proposed here tries to include the entire set of conference research papers using a small subset of all available keywords. The correctness and complexity of the scheme are analyzed. Proof of concept is established through some flagship conference held annually round the globe. The performance is favourable when compared with available text mining methods, as far as practicable. Results indicate that the scheme could be useful in characterizing topical themes of academic conferences, which may benefit both participants and organizers.