Faculty Dr Paramita Das

Dr Paramita Das

Assistant Professor

Department of Computer Science and Engineering

Contact Details

paramita.d@srmap.edu.in

Office Location

CV Raman Block, Level 11, Cubicle No: 121

Education

2025
PhD
IIT Kharagpur, West Bengal
India
2018
M.Tech
IIEST Shibpur, West Bengal
India
2015
B.Tech
Maulana Abul Kalam Azad University of Technology, West Bengal, West Bengal
India

Personal Website

Experience

  • Research Associate, IIT Kharagpur

Research Interest

  • My research interests lie in Computational Social Science, Natural Language Processing, Deep Learning, and Knowledge Graphs. I focus on developing multimodal AI pipelines for large-scale web corpora from collaborative platforms such as Wikipedia and Wikidata, with applications in automated quality assessment, misinformation detection, and bias analysis.
  • I am currently exploring Responsible AI principles, with an emphasis on understanding and mitigating biases in multimodal large models such as Vision-Language Models (VLMs), and advancing cultural safety alignment techniques for large language models.

Awards

  • Scholarship by the Govt. of India for securing rank in 12th standard board examination
  • Qualified UGC-NET for Assistant Professor.

Memberships

Publications

  • On the effective transfer of knowledge from English to Hindi Wikipedia

    Das P., Roy A., Chakraborty R., Mukherjee A.

    Conference paper, Proceedings - International Conference on Computational Linguistics, COLING, 2025,

    View abstract ⏷

    Although Wikipedia is the largest multilingual encyclopedia, it remains inherently incomplete. There is a significant disparity in the quality of content between high-resource languages (HRLs, e.g., English) and low-resource languages (LRLs, e.g., Hindi), with many LRL articles lacking adequate information. To bridge these content gaps we propose a lightweight framework to enhance knowledge equity between English and Hindi. In case the English Wikipedia page is not up-to-date, our framework extracts relevant information from external resources readily available (such as English books), and adapts it to align with Wikipedia’s distinctive style, including its neutral point of view (NPOV) policy, using in-context learning capabilities of large language models. The adapted content is then machine-translated into Hindi for integration into the corresponding Wikipedia articles. On the other hand, if the English version is comprehensive and up-to-date the framework directly transfers knowledge from English to Hindi. Our framework effectively generates new content for Hindi Wikipedia sections, enhancing Hindi Wikipedia articles respectively by 65% and 62% according to automatic and human judgment-based evaluations.
  • REVERSUM: A Multi-staged Retrieval-Augmented Generation Method to Enhance Wikipedia Tail Biographies through Personal Narratives

    Adak S., Meher P.M., Das P., Mukherjee A.

    Conference paper, Proceedings - International Conference on Computational Linguistics, COLING, 2025,

    View abstract ⏷

    Wikipedia is an invaluable resource for factual information about a wide range of entities. However, the quality of articles on less-known entities often lags behind that of the well-known ones. This study proposes a novel approach to enhancing Wikipedia’s B and C category biography articles by leveraging personal narratives such as autobiographies and biographies. By utilizing a multi-staged retrieval-augmented generation technique – REVERSUM – we aim to enrich the informational content of these lesser-known articles. Our study reveals that personal narratives can significantly improve the quality of Wikipedia articles, providing a rich source of reliable information that has been underutilized in previous studies. Based on crowd-based evaluation, REVERSUM generated content outperforms the best performing baseline by 17% in terms of integrability to the original Wikipedia article and 28.5% in terms of informativeness.
  • Diversity matters: Robustness of bias measurements in Wikidata

    Das P., Karnam S.K., Panda A., Guda B.P.R., Sarkar S., Mukherjee A.

    Conference paper, ACM International Conference Proceeding Series, 2023, DOI Link

    View abstract ⏷

    With the widespread use of knowledge graphs (KG) in various automated AI systems and applications, it is very important to ensure that information retrieval algorithms leveraging them are free from societal biases. Previous works have depicted biases that persist in KGs, as well as employed several metrics for measuring the biases. However, such studies lack in the systematic exploration of the sensitivity of the bias measurements, through varying sources of data, or the embedding algorithms used. To address this research gap, in this work, we present a holistic analysis of bias measurement on the knowledge graph. First, we attempt to reveal data biases that surface in Wikidata for thirteen different demographics selected from seven continents. Next, we attempt to unfold the variance in the detection of biases by two different knowledge graph embedding algorithms-TransE and ComplEx. We conduct our extensive experiments on a large number of occupations sampled from the thirteen demographics with respect to the sensitive attribute, i.e., gender. Our results show that the inherent data bias that persists in KG can be altered by specific algorithm bias as incorporated by KG embedding learning algorithms. Further, we show that the choice of the state-of-the-art KG embedding algorithm has a strong impact on the ranking of biased occupations irrespective of gender. In particular, we find that the embedding algorithm ComplEx is more robust to the choice of demographics compared to TransE. Subsequently, we observe that the similarity of the biased occupations across demographics is minimal which reflects the socio-cultural differences around the globe. This is often overlooked by most of the coarse-grained approaches working at the aggregate level. We believe that this full-scale audit of the bias measurement pipeline will raise awareness among the community while deriving insights related to design choices of data and algorithms both and refrain itself from the popular dogma of "one-size-fits-all".
  • Mining the online infosphere: A survey

    Adak S., Chakraborty S., Das P., Das M., Dash A., Hazra R., Mathew B., Saha P., Sarkar S., Mukherjee A.

    Review, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2022, DOI Link

    View abstract ⏷

    The evolution of Artificial Intelligence (AI)-based systems and applications have pervaded everyday life to make decisions that have a momentous impact on individuals and society. With the staggering growth of online data, often termed as the online infosphere, it has become paramount to monitor the infosphere to ensure social good as AI-based decisions are severely dependent. This survey aims to provide a comprehensive review of some of the most important research areas related to the infosphere, focusing on the technical challenges and potential solutions. The survey also outlines some of the important future directions. We begin by focussing on the collaborative systems that have emerged within the infosphere with a special thrust on Wikipedia. In the follow-up, we demonstrate how the infosphere has been instrumental in the growth of scientific citations and collaborations, thus fuelling interdisciplinary research. Finally, we illustrate the issues related to the governance of the infosphere, such as the tackling of the (a) rising hateful and abusive behavior and (b) bias and discrimination in different online platforms and news reporting. This article is categorized under: Application Areas > Internet.
  • Quality Change: Norm or Exception? Measurement, Analysis and Detection of Quality Change in Wikipedia

    Das P., Guda B.P.R., Seelaboyina S.B., Sarkar S., Mukherjee A.

    Article, Proceedings of the ACM on Human-Computer Interaction, 2022, DOI Link

    View abstract ⏷

    Wikipedia has been turned into an immensely popular crowd-sourced encyclopedia for information dissemination on numerous versatile topics in the form of subscription free content. It allows anyone to contribute so that the articles remain comprehensive and updated. For enrichment of content without compromising standards, the Wikipedia community enumerates a detailed set of guidelines, which should be followed. Based on these, articles are categorized into several quality classes by the Wikipedia editors with increasing adherence to guidelines. This quality assessment task by editors is laborious as well as demands platform expertise. As a first objective, in this paper, we study evolution of a Wikipedia article with respect to such quality scales. Our results show novel non-intuitive patterns emerging from this exploration. As a second objective we attempt to develop an automated data driven approach for the detection of the early signals influencing the quality change of articles. We posit this as a change point detection problem whereby we represent an article as a time series of consecutive revisions and encode every revision by a set of intuitive features. Finally, various change point detection algorithms are used to efficiently and accurately detect the future change points. We also perform various ablation studies to understand which group of features are most effective in identifying the change points. To the best of our knowledge, this is the first work that rigorously explores English Wikipedia article quality life cycle from the perspective of quality indicators and provides a novel unsupervised page level approach to detect quality switch, which can help in automatic content monitoring in Wikipedia thus contributing significantly to the CSCW community.
  • When Expertise Gone Missing: Uncovering the Loss of Prolific Contributors in Wikipedia

    Das P., Guda B.P.R., Chakraborty D., Sarkar S., Mukherjee A.

    Conference paper, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2021, DOI Link

    View abstract ⏷

    Success of planetary-scale online collaborative platforms such as Wikipedia is hinged on active and continued participation of its voluntary contributors. The phenomenal success of Wikipedia as a valued multilingual source of information is a testament to the possibilities of collective intelligence. Specifically, the sustained and prudent contributions by the experienced prolific editors play a crucial role to operate the platform smoothly for decades. However, it has been brought to light that growth of Wikipedia is stagnating in terms of the number of editors that faces steady decline over time. This decreasing productivity and ever increasing attrition rate in both newcomer and experienced editors is a major concern for not only the future of this platform but also for several industry-scale information retrieval systems such as Siri, Alexa which depend on Wikipedia as knowledge store. In this paper, we have studied the ongoing crisis in which experienced and prolific editors withdraw. We performed extensive analysis of the editor activities and their language usage to identify features that can forecast prolific Wikipedians, who are at risk of ceasing voluntary services. To the best of our knowledge, this is the first work which proposes a scalable prediction pipeline, towards detecting the prolific Wikipedians, who might be at a risk of retiring from the platform and, thereby, can potentially enable moderators to launch appropriate incentive mechanisms to retain such ‘would-be missing’ valued Wikipedians.
  • Accessibility metric for characterizing the relevance of conference papers

    Das P., Adhikari A., Mukherjee A.

    Conference paper, Proceedings of 2019 IEEE Region 10 Symposium, TENSYMP 2019, 2019, DOI Link

    View abstract ⏷

    Academic conference is a medium for rapid dissemination of knowledge. In an expanding horizon, often it becomes difficult to achieve proper impact due to topic mismatch. The problem affects either way, both conference organizers as well as the prospective participants. As a result, some of the works remain disconnected and disoriented from the majority volume of work presented in the conference. A solution to such situation is sought here by establishing better cohesion. The papers are modelled as nodes of a graph and these are connected through edges if they share a common keyword, specified during submission. An accessibility metric over this network is proposed in this work as being capable of judging the relevance of a paper. Two case studies are presented for proof of concept.
  • Generating a representative keyword subset pertaining to an academic conference series

    Adhikari A., Das P., Mukherjee A.

    Article, Scientometrics, 2019, DOI Link

    View abstract ⏷

    The breadth and velocity of innovation has resulted in explosion of research documents day by day. Academic conferences are being arranged worldwide, most of them in regular intervals, thereby generating a huge volume of research documents. Extracting undiscovered knowledge from the conference papers and thereby finding the inter-relationship of conference research topics is a challenging task. This paper attempts towards knowledge discovery for the conference with the help of keywords mentioned in the papers presented therein. The scheme proposed here tries to include the entire set of conference research papers using a small subset of all available keywords. The correctness and complexity of the scheme are analyzed. Proof of concept is established through some flagship conference held annually round the globe. The performance is favourable when compared with available text mining methods, as far as practicable. Results indicate that the scheme could be useful in characterizing topical themes of academic conferences, which may benefit both participants and organizers.

Patents

Projects

Scholars

Interests

  • Computational Social Science
  • Deep Learning
  • Natural Language Processing
  • Responsible AI

Thought Leaderships

There are no Thought Leaderships associated with this faculty.

Top Achievements

Research Area

No research areas found for this faculty.

Computer Science and Engineering is a fast-evolving discipline and this is an exciting time to become a Computer Scientist!

Computer Science and Engineering is a fast-evolving discipline and this is an exciting time to become a Computer Scientist!

Recent Updates

No recent updates found.

Education
2015
B.Tech
Maulana Abul Kalam Azad University of Technology, West Bengal
India
2018
M.Tech
IIEST Shibpur
India
2025
PhD
IIT Kharagpur
India
Experience
  • Research Associate, IIT Kharagpur
Research Interests
  • My research interests lie in Computational Social Science, Natural Language Processing, Deep Learning, and Knowledge Graphs. I focus on developing multimodal AI pipelines for large-scale web corpora from collaborative platforms such as Wikipedia and Wikidata, with applications in automated quality assessment, misinformation detection, and bias analysis.
  • I am currently exploring Responsible AI principles, with an emphasis on understanding and mitigating biases in multimodal large models such as Vision-Language Models (VLMs), and advancing cultural safety alignment techniques for large language models.
Awards & Fellowships
  • Scholarship by the Govt. of India for securing rank in 12th standard board examination
  • Qualified UGC-NET for Assistant Professor.
Memberships
Publications
  • On the effective transfer of knowledge from English to Hindi Wikipedia

    Das P., Roy A., Chakraborty R., Mukherjee A.

    Conference paper, Proceedings - International Conference on Computational Linguistics, COLING, 2025,

    View abstract ⏷

    Although Wikipedia is the largest multilingual encyclopedia, it remains inherently incomplete. There is a significant disparity in the quality of content between high-resource languages (HRLs, e.g., English) and low-resource languages (LRLs, e.g., Hindi), with many LRL articles lacking adequate information. To bridge these content gaps we propose a lightweight framework to enhance knowledge equity between English and Hindi. In case the English Wikipedia page is not up-to-date, our framework extracts relevant information from external resources readily available (such as English books), and adapts it to align with Wikipedia’s distinctive style, including its neutral point of view (NPOV) policy, using in-context learning capabilities of large language models. The adapted content is then machine-translated into Hindi for integration into the corresponding Wikipedia articles. On the other hand, if the English version is comprehensive and up-to-date the framework directly transfers knowledge from English to Hindi. Our framework effectively generates new content for Hindi Wikipedia sections, enhancing Hindi Wikipedia articles respectively by 65% and 62% according to automatic and human judgment-based evaluations.
  • REVERSUM: A Multi-staged Retrieval-Augmented Generation Method to Enhance Wikipedia Tail Biographies through Personal Narratives

    Adak S., Meher P.M., Das P., Mukherjee A.

    Conference paper, Proceedings - International Conference on Computational Linguistics, COLING, 2025,

    View abstract ⏷

    Wikipedia is an invaluable resource for factual information about a wide range of entities. However, the quality of articles on less-known entities often lags behind that of the well-known ones. This study proposes a novel approach to enhancing Wikipedia’s B and C category biography articles by leveraging personal narratives such as autobiographies and biographies. By utilizing a multi-staged retrieval-augmented generation technique – REVERSUM – we aim to enrich the informational content of these lesser-known articles. Our study reveals that personal narratives can significantly improve the quality of Wikipedia articles, providing a rich source of reliable information that has been underutilized in previous studies. Based on crowd-based evaluation, REVERSUM generated content outperforms the best performing baseline by 17% in terms of integrability to the original Wikipedia article and 28.5% in terms of informativeness.
  • Diversity matters: Robustness of bias measurements in Wikidata

    Das P., Karnam S.K., Panda A., Guda B.P.R., Sarkar S., Mukherjee A.

    Conference paper, ACM International Conference Proceeding Series, 2023, DOI Link

    View abstract ⏷

    With the widespread use of knowledge graphs (KG) in various automated AI systems and applications, it is very important to ensure that information retrieval algorithms leveraging them are free from societal biases. Previous works have depicted biases that persist in KGs, as well as employed several metrics for measuring the biases. However, such studies lack in the systematic exploration of the sensitivity of the bias measurements, through varying sources of data, or the embedding algorithms used. To address this research gap, in this work, we present a holistic analysis of bias measurement on the knowledge graph. First, we attempt to reveal data biases that surface in Wikidata for thirteen different demographics selected from seven continents. Next, we attempt to unfold the variance in the detection of biases by two different knowledge graph embedding algorithms-TransE and ComplEx. We conduct our extensive experiments on a large number of occupations sampled from the thirteen demographics with respect to the sensitive attribute, i.e., gender. Our results show that the inherent data bias that persists in KG can be altered by specific algorithm bias as incorporated by KG embedding learning algorithms. Further, we show that the choice of the state-of-the-art KG embedding algorithm has a strong impact on the ranking of biased occupations irrespective of gender. In particular, we find that the embedding algorithm ComplEx is more robust to the choice of demographics compared to TransE. Subsequently, we observe that the similarity of the biased occupations across demographics is minimal which reflects the socio-cultural differences around the globe. This is often overlooked by most of the coarse-grained approaches working at the aggregate level. We believe that this full-scale audit of the bias measurement pipeline will raise awareness among the community while deriving insights related to design choices of data and algorithms both and refrain itself from the popular dogma of "one-size-fits-all".
  • Mining the online infosphere: A survey

    Adak S., Chakraborty S., Das P., Das M., Dash A., Hazra R., Mathew B., Saha P., Sarkar S., Mukherjee A.

    Review, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2022, DOI Link

    View abstract ⏷

    The evolution of Artificial Intelligence (AI)-based systems and applications have pervaded everyday life to make decisions that have a momentous impact on individuals and society. With the staggering growth of online data, often termed as the online infosphere, it has become paramount to monitor the infosphere to ensure social good as AI-based decisions are severely dependent. This survey aims to provide a comprehensive review of some of the most important research areas related to the infosphere, focusing on the technical challenges and potential solutions. The survey also outlines some of the important future directions. We begin by focussing on the collaborative systems that have emerged within the infosphere with a special thrust on Wikipedia. In the follow-up, we demonstrate how the infosphere has been instrumental in the growth of scientific citations and collaborations, thus fuelling interdisciplinary research. Finally, we illustrate the issues related to the governance of the infosphere, such as the tackling of the (a) rising hateful and abusive behavior and (b) bias and discrimination in different online platforms and news reporting. This article is categorized under: Application Areas > Internet.
  • Quality Change: Norm or Exception? Measurement, Analysis and Detection of Quality Change in Wikipedia

    Das P., Guda B.P.R., Seelaboyina S.B., Sarkar S., Mukherjee A.

    Article, Proceedings of the ACM on Human-Computer Interaction, 2022, DOI Link

    View abstract ⏷

    Wikipedia has been turned into an immensely popular crowd-sourced encyclopedia for information dissemination on numerous versatile topics in the form of subscription free content. It allows anyone to contribute so that the articles remain comprehensive and updated. For enrichment of content without compromising standards, the Wikipedia community enumerates a detailed set of guidelines, which should be followed. Based on these, articles are categorized into several quality classes by the Wikipedia editors with increasing adherence to guidelines. This quality assessment task by editors is laborious as well as demands platform expertise. As a first objective, in this paper, we study evolution of a Wikipedia article with respect to such quality scales. Our results show novel non-intuitive patterns emerging from this exploration. As a second objective we attempt to develop an automated data driven approach for the detection of the early signals influencing the quality change of articles. We posit this as a change point detection problem whereby we represent an article as a time series of consecutive revisions and encode every revision by a set of intuitive features. Finally, various change point detection algorithms are used to efficiently and accurately detect the future change points. We also perform various ablation studies to understand which group of features are most effective in identifying the change points. To the best of our knowledge, this is the first work that rigorously explores English Wikipedia article quality life cycle from the perspective of quality indicators and provides a novel unsupervised page level approach to detect quality switch, which can help in automatic content monitoring in Wikipedia thus contributing significantly to the CSCW community.
  • When Expertise Gone Missing: Uncovering the Loss of Prolific Contributors in Wikipedia

    Das P., Guda B.P.R., Chakraborty D., Sarkar S., Mukherjee A.

    Conference paper, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2021, DOI Link

    View abstract ⏷

    Success of planetary-scale online collaborative platforms such as Wikipedia is hinged on active and continued participation of its voluntary contributors. The phenomenal success of Wikipedia as a valued multilingual source of information is a testament to the possibilities of collective intelligence. Specifically, the sustained and prudent contributions by the experienced prolific editors play a crucial role to operate the platform smoothly for decades. However, it has been brought to light that growth of Wikipedia is stagnating in terms of the number of editors that faces steady decline over time. This decreasing productivity and ever increasing attrition rate in both newcomer and experienced editors is a major concern for not only the future of this platform but also for several industry-scale information retrieval systems such as Siri, Alexa which depend on Wikipedia as knowledge store. In this paper, we have studied the ongoing crisis in which experienced and prolific editors withdraw. We performed extensive analysis of the editor activities and their language usage to identify features that can forecast prolific Wikipedians, who are at risk of ceasing voluntary services. To the best of our knowledge, this is the first work which proposes a scalable prediction pipeline, towards detecting the prolific Wikipedians, who might be at a risk of retiring from the platform and, thereby, can potentially enable moderators to launch appropriate incentive mechanisms to retain such ‘would-be missing’ valued Wikipedians.
  • Accessibility metric for characterizing the relevance of conference papers

    Das P., Adhikari A., Mukherjee A.

    Conference paper, Proceedings of 2019 IEEE Region 10 Symposium, TENSYMP 2019, 2019, DOI Link

    View abstract ⏷

    Academic conference is a medium for rapid dissemination of knowledge. In an expanding horizon, often it becomes difficult to achieve proper impact due to topic mismatch. The problem affects either way, both conference organizers as well as the prospective participants. As a result, some of the works remain disconnected and disoriented from the majority volume of work presented in the conference. A solution to such situation is sought here by establishing better cohesion. The papers are modelled as nodes of a graph and these are connected through edges if they share a common keyword, specified during submission. An accessibility metric over this network is proposed in this work as being capable of judging the relevance of a paper. Two case studies are presented for proof of concept.
  • Generating a representative keyword subset pertaining to an academic conference series

    Adhikari A., Das P., Mukherjee A.

    Article, Scientometrics, 2019, DOI Link

    View abstract ⏷

    The breadth and velocity of innovation has resulted in explosion of research documents day by day. Academic conferences are being arranged worldwide, most of them in regular intervals, thereby generating a huge volume of research documents. Extracting undiscovered knowledge from the conference papers and thereby finding the inter-relationship of conference research topics is a challenging task. This paper attempts towards knowledge discovery for the conference with the help of keywords mentioned in the papers presented therein. The scheme proposed here tries to include the entire set of conference research papers using a small subset of all available keywords. The correctness and complexity of the scheme are analyzed. Proof of concept is established through some flagship conference held annually round the globe. The performance is favourable when compared with available text mining methods, as far as practicable. Results indicate that the scheme could be useful in characterizing topical themes of academic conferences, which may benefit both participants and organizers.
Contact Details

paramita.d@srmap.edu.in

Scholars
Interests

  • Computational Social Science
  • Deep Learning
  • Natural Language Processing
  • Responsible AI

Education
2015
B.Tech
Maulana Abul Kalam Azad University of Technology, West Bengal
India
2018
M.Tech
IIEST Shibpur
India
2025
PhD
IIT Kharagpur
India
Experience
  • Research Associate, IIT Kharagpur
Research Interests
  • My research interests lie in Computational Social Science, Natural Language Processing, Deep Learning, and Knowledge Graphs. I focus on developing multimodal AI pipelines for large-scale web corpora from collaborative platforms such as Wikipedia and Wikidata, with applications in automated quality assessment, misinformation detection, and bias analysis.
  • I am currently exploring Responsible AI principles, with an emphasis on understanding and mitigating biases in multimodal large models such as Vision-Language Models (VLMs), and advancing cultural safety alignment techniques for large language models.
Awards & Fellowships
  • Scholarship by the Govt. of India for securing rank in 12th standard board examination
  • Qualified UGC-NET for Assistant Professor.
Memberships
Publications
  • On the effective transfer of knowledge from English to Hindi Wikipedia

    Das P., Roy A., Chakraborty R., Mukherjee A.

    Conference paper, Proceedings - International Conference on Computational Linguistics, COLING, 2025,

    View abstract ⏷

    Although Wikipedia is the largest multilingual encyclopedia, it remains inherently incomplete. There is a significant disparity in the quality of content between high-resource languages (HRLs, e.g., English) and low-resource languages (LRLs, e.g., Hindi), with many LRL articles lacking adequate information. To bridge these content gaps we propose a lightweight framework to enhance knowledge equity between English and Hindi. In case the English Wikipedia page is not up-to-date, our framework extracts relevant information from external resources readily available (such as English books), and adapts it to align with Wikipedia’s distinctive style, including its neutral point of view (NPOV) policy, using in-context learning capabilities of large language models. The adapted content is then machine-translated into Hindi for integration into the corresponding Wikipedia articles. On the other hand, if the English version is comprehensive and up-to-date the framework directly transfers knowledge from English to Hindi. Our framework effectively generates new content for Hindi Wikipedia sections, enhancing Hindi Wikipedia articles respectively by 65% and 62% according to automatic and human judgment-based evaluations.
  • REVERSUM: A Multi-staged Retrieval-Augmented Generation Method to Enhance Wikipedia Tail Biographies through Personal Narratives

    Adak S., Meher P.M., Das P., Mukherjee A.

    Conference paper, Proceedings - International Conference on Computational Linguistics, COLING, 2025,

    View abstract ⏷

    Wikipedia is an invaluable resource for factual information about a wide range of entities. However, the quality of articles on less-known entities often lags behind that of the well-known ones. This study proposes a novel approach to enhancing Wikipedia’s B and C category biography articles by leveraging personal narratives such as autobiographies and biographies. By utilizing a multi-staged retrieval-augmented generation technique – REVERSUM – we aim to enrich the informational content of these lesser-known articles. Our study reveals that personal narratives can significantly improve the quality of Wikipedia articles, providing a rich source of reliable information that has been underutilized in previous studies. Based on crowd-based evaluation, REVERSUM generated content outperforms the best performing baseline by 17% in terms of integrability to the original Wikipedia article and 28.5% in terms of informativeness.
  • Diversity matters: Robustness of bias measurements in Wikidata

    Das P., Karnam S.K., Panda A., Guda B.P.R., Sarkar S., Mukherjee A.

    Conference paper, ACM International Conference Proceeding Series, 2023, DOI Link

    View abstract ⏷

    With the widespread use of knowledge graphs (KG) in various automated AI systems and applications, it is very important to ensure that information retrieval algorithms leveraging them are free from societal biases. Previous works have depicted biases that persist in KGs, as well as employed several metrics for measuring the biases. However, such studies lack in the systematic exploration of the sensitivity of the bias measurements, through varying sources of data, or the embedding algorithms used. To address this research gap, in this work, we present a holistic analysis of bias measurement on the knowledge graph. First, we attempt to reveal data biases that surface in Wikidata for thirteen different demographics selected from seven continents. Next, we attempt to unfold the variance in the detection of biases by two different knowledge graph embedding algorithms-TransE and ComplEx. We conduct our extensive experiments on a large number of occupations sampled from the thirteen demographics with respect to the sensitive attribute, i.e., gender. Our results show that the inherent data bias that persists in KG can be altered by specific algorithm bias as incorporated by KG embedding learning algorithms. Further, we show that the choice of the state-of-the-art KG embedding algorithm has a strong impact on the ranking of biased occupations irrespective of gender. In particular, we find that the embedding algorithm ComplEx is more robust to the choice of demographics compared to TransE. Subsequently, we observe that the similarity of the biased occupations across demographics is minimal which reflects the socio-cultural differences around the globe. This is often overlooked by most of the coarse-grained approaches working at the aggregate level. We believe that this full-scale audit of the bias measurement pipeline will raise awareness among the community while deriving insights related to design choices of data and algorithms both and refrain itself from the popular dogma of "one-size-fits-all".
  • Mining the online infosphere: A survey

    Adak S., Chakraborty S., Das P., Das M., Dash A., Hazra R., Mathew B., Saha P., Sarkar S., Mukherjee A.

    Review, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2022, DOI Link

    View abstract ⏷

    The evolution of Artificial Intelligence (AI)-based systems and applications have pervaded everyday life to make decisions that have a momentous impact on individuals and society. With the staggering growth of online data, often termed as the online infosphere, it has become paramount to monitor the infosphere to ensure social good as AI-based decisions are severely dependent. This survey aims to provide a comprehensive review of some of the most important research areas related to the infosphere, focusing on the technical challenges and potential solutions. The survey also outlines some of the important future directions. We begin by focussing on the collaborative systems that have emerged within the infosphere with a special thrust on Wikipedia. In the follow-up, we demonstrate how the infosphere has been instrumental in the growth of scientific citations and collaborations, thus fuelling interdisciplinary research. Finally, we illustrate the issues related to the governance of the infosphere, such as the tackling of the (a) rising hateful and abusive behavior and (b) bias and discrimination in different online platforms and news reporting. This article is categorized under: Application Areas > Internet.
  • Quality Change: Norm or Exception? Measurement, Analysis and Detection of Quality Change in Wikipedia

    Das P., Guda B.P.R., Seelaboyina S.B., Sarkar S., Mukherjee A.

    Article, Proceedings of the ACM on Human-Computer Interaction, 2022, DOI Link

    View abstract ⏷

    Wikipedia has been turned into an immensely popular crowd-sourced encyclopedia for information dissemination on numerous versatile topics in the form of subscription free content. It allows anyone to contribute so that the articles remain comprehensive and updated. For enrichment of content without compromising standards, the Wikipedia community enumerates a detailed set of guidelines, which should be followed. Based on these, articles are categorized into several quality classes by the Wikipedia editors with increasing adherence to guidelines. This quality assessment task by editors is laborious as well as demands platform expertise. As a first objective, in this paper, we study evolution of a Wikipedia article with respect to such quality scales. Our results show novel non-intuitive patterns emerging from this exploration. As a second objective we attempt to develop an automated data driven approach for the detection of the early signals influencing the quality change of articles. We posit this as a change point detection problem whereby we represent an article as a time series of consecutive revisions and encode every revision by a set of intuitive features. Finally, various change point detection algorithms are used to efficiently and accurately detect the future change points. We also perform various ablation studies to understand which group of features are most effective in identifying the change points. To the best of our knowledge, this is the first work that rigorously explores English Wikipedia article quality life cycle from the perspective of quality indicators and provides a novel unsupervised page level approach to detect quality switch, which can help in automatic content monitoring in Wikipedia thus contributing significantly to the CSCW community.
  • When Expertise Gone Missing: Uncovering the Loss of Prolific Contributors in Wikipedia

    Das P., Guda B.P.R., Chakraborty D., Sarkar S., Mukherjee A.

    Conference paper, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2021, DOI Link

    View abstract ⏷

    Success of planetary-scale online collaborative platforms such as Wikipedia is hinged on active and continued participation of its voluntary contributors. The phenomenal success of Wikipedia as a valued multilingual source of information is a testament to the possibilities of collective intelligence. Specifically, the sustained and prudent contributions by the experienced prolific editors play a crucial role to operate the platform smoothly for decades. However, it has been brought to light that growth of Wikipedia is stagnating in terms of the number of editors that faces steady decline over time. This decreasing productivity and ever increasing attrition rate in both newcomer and experienced editors is a major concern for not only the future of this platform but also for several industry-scale information retrieval systems such as Siri, Alexa which depend on Wikipedia as knowledge store. In this paper, we have studied the ongoing crisis in which experienced and prolific editors withdraw. We performed extensive analysis of the editor activities and their language usage to identify features that can forecast prolific Wikipedians, who are at risk of ceasing voluntary services. To the best of our knowledge, this is the first work which proposes a scalable prediction pipeline, towards detecting the prolific Wikipedians, who might be at a risk of retiring from the platform and, thereby, can potentially enable moderators to launch appropriate incentive mechanisms to retain such ‘would-be missing’ valued Wikipedians.
  • Accessibility metric for characterizing the relevance of conference papers

    Das P., Adhikari A., Mukherjee A.

    Conference paper, Proceedings of 2019 IEEE Region 10 Symposium, TENSYMP 2019, 2019, DOI Link

    View abstract ⏷

    Academic conference is a medium for rapid dissemination of knowledge. In an expanding horizon, often it becomes difficult to achieve proper impact due to topic mismatch. The problem affects either way, both conference organizers as well as the prospective participants. As a result, some of the works remain disconnected and disoriented from the majority volume of work presented in the conference. A solution to such situation is sought here by establishing better cohesion. The papers are modelled as nodes of a graph and these are connected through edges if they share a common keyword, specified during submission. An accessibility metric over this network is proposed in this work as being capable of judging the relevance of a paper. Two case studies are presented for proof of concept.
  • Generating a representative keyword subset pertaining to an academic conference series

    Adhikari A., Das P., Mukherjee A.

    Article, Scientometrics, 2019, DOI Link

    View abstract ⏷

    The breadth and velocity of innovation has resulted in explosion of research documents day by day. Academic conferences are being arranged worldwide, most of them in regular intervals, thereby generating a huge volume of research documents. Extracting undiscovered knowledge from the conference papers and thereby finding the inter-relationship of conference research topics is a challenging task. This paper attempts towards knowledge discovery for the conference with the help of keywords mentioned in the papers presented therein. The scheme proposed here tries to include the entire set of conference research papers using a small subset of all available keywords. The correctness and complexity of the scheme are analyzed. Proof of concept is established through some flagship conference held annually round the globe. The performance is favourable when compared with available text mining methods, as far as practicable. Results indicate that the scheme could be useful in characterizing topical themes of academic conferences, which may benefit both participants and organizers.
Contact Details

paramita.d@srmap.edu.in

Scholars