Deep Learning-Based Intrusion Detection in IoT: A Cross-Model Performance Analysis
Bhattacharya S., Talukdar A., Mondal K.C.
Conference paper, Lecture Notes in Computer Science, 2026, DOI Link
View abstract ⏷
The proliferation of Internet of Things (IoT) networks has increased exposure to sophisticated cyberattacks, demanding robust intrusion detection systems (IDS). This paper presents a unified, cross-model performance analysis of five deep learning (DL) frameworks—Transformer-based IDS, Graph Neural Networks (GNN), Conditional Variational Autoencoder (CTVAE), SimCLR-based contrastive learning, and Federated MLP—under a common preprocessing and evaluation protocol. Experiments on two recent real-world datasets (RT-IoT 2022 and ACI IoT 2023) are used to evaluate the performance matrices. Transformer and SimCLR models achieve up to 99% accuracy on RT-IoT 2022, while Federated MLP excels on ACI IoT 2023, highlighting deployment trade-offs between centralized and privacy-preserving settings. We further discuss computational efficiency, interpretability considerations, and practical deployment guidance, and outline directions for hybrid and lightweight DL on edge devices. Unlike prior works that evaluate single-model IDS frameworks, this study provides the first unified cross-model comparison integrating both centralized and decentralized deep learning paradigms for IoT intrusion detection.
Association of IUCN-threatened Indian mangroves: A novel data-driven rule filtering approach for restoration strategy
Ghosh M., Mondal S., Fajriyah R., Mondal K.C., Roy A.
Article, Ecological Informatics, 2025, DOI Link
View abstract ⏷
Restoring biodiversity is crucial for ecological sustainability. This study introduces a novel data-driven rule-filtering framework that adopts domain knowledge of taxonomic distinctness and proposes a new metric, total taxonomic distinctness, to prioritize species selection for targeted restoration efforts. We extract and validate association rules to identify frequently co-occurring species and rank them based on total taxonomic distinctness. This structured approach ensures the selection of ecologically significant species that enhance biodiversity and ecosystem resilience. We apply this three-stage framework to Indian mangrove ecosystems, focusing on four IUCN Red List species: Heritiera fomes, Sonneratia griffithii, Ceriops decandra, and Phoenix paludosa. Our results indicate that taxonomically distinct species tend to co-occur more frequently, enhancing ecosystem resilience. Statistical validation using multiple hypothesis testing ensures the robustness of our findings. To assess the framework's broader applicability, we extend our analysis to species presence-absence data from sacred groves in the laterite regions of eastern India. The results reinforce our previous findings, demonstrating frequent association patterns among taxonomically distinct species. This study provides actionable insights for ecological restoration, guiding species selection and co-planting strategies. The framework is adaptable across ecosystems, offering a scalable approach to biodiversity conservation.
Automated credit assessment framework using ETL process and machine learning
Article, Innovations in Systems and Software Engineering, 2025, DOI Link
View abstract ⏷
In the current business scenario, real-time analysis of enterprise data through Business Intelligence (BI) is crucial for supporting operational activities and taking any strategic decision. The automated ETL (extraction, transformation, and load) process ensures data ingestion into the data warehouse in near real-time, and insights are generated through the BI process based on real-time data. In this paper, we have concentrated on automated credit risk assessment in the financial domain based on the machine learning approach. The machine learning-based classification techniques can furnish a self-regulating process to categorize data. Establishing an automated credit decision-making system helps the lending institution to manage the risks, increase operational efficiency and comply with regulators. In this paper, an empirical approach is taken for credit risk assessment using logistic regression and neural network classification method in compliance with Basel II standards. Here, Basel II standards are adopted to calculate the expected loss. The required data integration for building machine learning models is done through an automated ETL process. We have concluded this research work by evaluating this new methodology for credit risk assessment.
Introducing suffix forest for mining tri-clusters from time-series data: Introducing suffix forest for mining tri-clusters..: K. C. Mondal et al.
Mondal K.C., Ghosh M., Fajriyah R., Roy A.
Article, Innovations in Systems and Software Engineering, 2024, DOI Link
View abstract ⏷
Three-dimensional data is becoming more prevalent these days. Unsupervised data analysis can be used to find hypothesized patterns of interest from the three-dimensional data. In this context, clustering can be used to group observations along a single dimension, but its usage is restricted in three-dimensional data domains as the observations are significantly connected in subspaces of the overall space. Bi-clustering addresses the issue of subspace clustering but ignores the third dimension. As a result, the concept of tri-clustering, coherent subspaces within three-dimensional data, was introduced. To deal with these issues, tri-clustering, the identification of coherent subspaces within three-dimensional data, has been extensively studied. Despite the wide range of contributions to this topic, there is still room for improvement in terms of a more structured view of tri-clustering, extracting multiple forms (e.g., row-major clusters, regular and irregular clusters), and improved algorithmic techniques. This paper introduces a novel data structure suffix forest to design a tri-clustering algorithm. The application of this data mining algorithm is reflected on the Indian Forest Dataset published by the Forest Survey of India. Here, we were successfully able to implement the tri-clustering concept with an informative structure where changes in forest cover and mangrove cover over time are monitored in different states and union territories. This kind of study may be the pioneer for research on biodiversity data analysis for exploring the relationships of different biodiversity traits with respect to both time and geographical region would be one of our future research works.
Comparative Analysis of Object-Based Big Data Storage Systems on Architectures and Services: A Recent Survey
Mondal A.S., Sanyal M., Barua H.B., Chattopadhyay S., Mondal K.C.
Review, Journal of The Institution of Engineers (India): Series B, 2024, DOI Link
View abstract ⏷
Object storage systems are a flexible class of storage where data are architectured and stored in the form of units called objects. Objects may store structured, semi-structured or unstructured data, so it is well suited for big data in cloud environments. It was introduced particularly for academic purposes at Carnegie Mellon University and the University of California at Berkeley. Till then, large-scale file systems were practiced for their search engine. The earliest instance of object storage used in an organization was EMC’s Centera platform (2001). This article plays an important role in the industry and research in the field of understanding the available storage systems for supporting big data features, whereas our analysis of the object-based storage system is based on the classification of the storage systems and the comparative study of these systems’ architecture. It helps to distinguish the basic characteristics of object-based storage systems with respect to the implementation of these storage systems. Furthermore, it discusses the future perspective, research challenges and limitations.
Non-exhaust particulate pollution in Asian countries: A comprehensive review of sources, composition, and health effects
Roy A., Mandal M., Das S., Kumar M., Popek R., Awasthi A., Giri B.S., Mondal K.C., Sarkar A.
Review, Environmental Engineering Research, 2024, DOI Link
View abstract ⏷
Recent regulations on exhaust emissions have led to an increase in non-exhaust emissions, which now surpasses exhaust emissions. Non-exhaust emissions are mainly generated from brake and tire particle abrasion, road wear, and re-suspended road dust. In Asia, non-exhaust emissions have increased significantly over the past 50 years, resulting in almost 92% of the population breathing polluted air, which accounts for 70% of air pollution related-deaths. Most Asian countries with poor air quality are developing or underdeveloped. Taking this into consideration, the current study aims to shed light on particulate pollution from non-exhaust emissions in the Asian context to assess the current status and its health consequences and provides technological solutions. The study is based on an in-depth analysis of existing reviews and research concerning non-exhaust emissions and their health impacts in Asia to pinpoint knowledge gaps. The study found that particulate pollutants had exceeded WHO's standards in many Asian countries, bringing deleterious health consequences among children and the elderly. The findings underscore the significance of future researchers' efforts to devise solutions that curtail non-exhaust emissions, ultimately reducing air pollution, augmenting air quality, fostering better health outcomes, and paving way for a more sustainable future before it is too late.
Challenges and solutions of real-time data integration techniques by ETL application
Biswas N., Biswas S., Mondal K.C., Maiti S.
Book chapter, Big Data Analytics Techniques for Market Intelligence, 2024, DOI Link
View abstract ⏷
Business organizations are trying to focus from the traditional extract-transform-load (ETL) system towards real-time implementation of the ETL process. Traditional ETL process upgrades new data to the data warehouse (DW) at predefined time intervals when the DW is in off-line mode. Modern organizations want to capture and respond to business events faster than ever. Accessing fresh data is not possible using traditional ETL. Real-time ETL can reflect fresh data on the warehouse immediately at the occurrence of an event in the operational data store. Therefore, the key tool for business trade lies in real-time enterprise DW enabled with Business Intelligence. This study provides an overview of ETL process and its evolution towards real-time ETL. This chapter will represent the real-time ETL characteristics, its technical challenges, and some popular real-time ETL implementation methods. Finally, some future directions towards real-time ETL technology are explored.
Regression Analysis for Finding Correlation on Indian Agricultural Data
Hazra S., Mondal K.C.
Conference paper, Communications in Computer and Information Science, 2024, DOI Link
View abstract ⏷
Food scarcity will be a threatening problem in front of the global civilization due to huge growth in world population and reduce in world agricultural land covers. Agriculture depends on several factors like climate, soil conditions, irrigation, fertilization, condition of pests. The increase in carbon footprint due to civilization adversely affects the worldwide climate which causes unexpected floods, droughts and increase in pests directly affects the productivity and quality of agricultural products. We can increase the productivity of agricultural sector by analyzing and predicting the data of external parameters like carbon footprint, rainfall information, moisture information, soil information by predicting flood, drought, pest movement and other factors. In this article, we tried to perform the prediction of rainfall and carbon-footprint and used regression analysis for finding the correlation between Indian agricultural data containing carbon footprint and rainfall over Indian geography which can helps to increase the indian agricultural product.
An Introduction to KDB: Knowledge Discovery in Biodiversity
Ghosh M., Mondal S., Roy A., Mondal K.C.
Conference paper, Communications in Computer and Information Science, 2024, DOI Link
View abstract ⏷
The most basic method of experimentation using data mining algorithms is the command prompt. A convenient approach of interactive graphical user interfaces can be supplied for data exploration to build up complex studies. A graphical user interface gives an upgraded status for experimental data mining. An innovative proposal for employing data mining methodology on biodiversity data is shown in this article through the KDB (Knowledge Discovery in Biodiversity). It provides a platform for domain researchers to apply their datasets to domain-specific data mining algorithms for further analysis. A convenient interactive graphical user interface is provided for data exploration for the biodiversity domain to build up complex studies. The proposed data mining methods are developed in Java, while the website is built in PHP.
An irregular CLA-based novel frequent pattern mining approach
Ghosh M., Mondal S., Moondra H., Utari D.T., Roy A., Mondal K.C.
Article, International Journal of Data Mining, Modelling and Management, 2024, DOI Link
View abstract ⏷
Frequent itemset mining has received a lot of attention in the field of data mining. Its main objective is to find groups of items that consistently appear together in datasets. Even while frequent itemset mining is useful, the algorithms for mining frequent itemsets have quite high resource requirements. In order to optimise the time and memory needs, a few improvements have been made in recent years. This study proposes CellFPM, a straightforward yet effective cellular learning automata-based method for finding frequent itemset occurrences. It works efficiently with large datasets. The efficiency of the proposed approach in time and memory requirements has been evaluated using benchmark datasets explicitly designed for performance measure. The varying size and density of the test datasets have confirmed the scalability of the suggested method. The findings show that CellFPM consistently surpasses the leading algorithms in terms of runtime and memory usage, particularly memory usage mostly.
Frequent itemset mining using FP-tree: a CLA-based approach and its extended application in biodiversity data
Ghosh M., Roy A., Sil P., Mondal K.C.
Article, Innovations in Systems and Software Engineering, 2023, DOI Link
View abstract ⏷
The efficient discovery of frequent itemsets from a transaction database is the fundamental step for association rule mining in data analytics. Interesting associations among the items present in a transaction database contribute to knowledge enrichment. Thus, decision-making and pattern generation from the massive amounts of data become effortless. But one of the major problems associated with the algorithms of frequent itemset mining is excessive memory requirements, which cause them to be inappropriate for larger datasets with itemsets having high cardinality. A few novel data structures for mining frequent itemsets have been introduced in recent years. For example, N-List, NodeSet, DiffNodeSet, proximity list, etc. have been proposed that show a coherent mining approach for improving the execution time while still leaving the scope for further improvements in memory requirements. In this paper, we propose a novel algorithm using cellular learning automata (CLA) and multiple FP tree structures for frequent itemset mining that is efficient in both time and memory requirements. Extensive experimentation has been performed by comparing the performance of the proposed method with the leading algorithms and using publicly available real and synthetic datasets designed specifically for pattern mining algorithms. It can be concluded that the proposed method is memory-efficient and shows comparable execution time with varying dataset dimensions and dataset density, assuring its robustness. In addition to the proposal of the new methodology for frequent itemset mining, its potential domain-specific usage in species biodiversity data analysis has also been discussed. The fact that which groups of species are closely related can be derived from huge occurrence records of species datasets. This could help in understanding species co-occurrence in multiple sites, which in turn assists in solving ecology-related issues for afforesting and reforesting. It could be a step forward toward the advantageous use of computer science in the biodiversity domain.
Data Integration Process Automation Using Machine Learning: Issues and Solution
Mondal K.C., Saha S.
Book chapter, Machine Learning for Data Science Handbook: Data Mining and Knowledge Discovery Handbook, Third Edition, 2023, DOI Link
View abstract ⏷
In today’s data-driven world, real-time analysis of enterprise data plays an important role in the organization to take strategic decisions and improve business operations. The availability of data in real time and analyzing those data instantly are becoming a challenge for most organizations. Outdated data do not add any value to an organization. The company needs a reliable, minute-to-minute information to improve operational efficiency and make better proactive business decisions. Typically, running a data warehouse in an enterprise requires coordination of many operations across multiple teams. Also, a lot of manual intervention is required, which is error-prone. Executing all related steps in correct sequences under correct conditions can be a challenge. The automated data integration, specifically, ETL (Extract-Transform-Load) process, is the only solution to address all these problems. Improving ETL process system data flows can provide a better return on your business investment. Since, data across multiple systems are integrated into data warehouse (DWH). There can be quality issues of integrated data that can generate inaccurate analytic. Hence, data need to be pre-processed and optimized for the business intelligence process. Automated data integration, specifically the ETL process, can address the issues of traditional data warehouse related to availability and quality of data. Here, the solution approach of the automated ETL process is explained, which supports continuous integration. It also describes how machine learning can be leveraged in the ETL process so that the quality and availability of data not ever have been compromised.
Cloud Big Data Mining and Analytics: Bringing Greenness and Acceleration in the Cloud
Barua H.B., Mondal K.C.
Book chapter, Machine Learning for Data Science Handbook: Data Mining and Knowledge Discovery Handbook, Third Edition, 2023, DOI Link
View abstract ⏷
Big data is gaining overwhelming attention since the last decade. Almost all the fields of science and technology have experienced a considerable impact from it. The cloud computing paradigm has been targeted for big data processing and mining in a more efficient manner using the plethora of resources available from computing nodes to efficient storage. Cloud data mining introduces the concept of performing data mining and analytics of huge data in the cloud availing the cloud resources. But can we do better? Yes, of course! The main contribution of this chapter is the identification of four game-changing technologies for the acceleration of computing and analysis of data mining tasks in the cloud. Graphics Processing Units can be used to further accelerate the mining or analytic process, which is called GPU accelerated analytics. Further, Approximate Computing can also be introduced in big data analytics for bringing efficacy in the process by reducing time and energy and hence facilitating greenness in the entire computing process. Quantum Computing is a paradigm that is gaining pace in recent times which can also facilitate efficient and fast big data analytics in very little time. We have surveyed these three technologies and established their importance in big data mining with a holistic architecture by combining these three game-changers with the perspective of big data. We have also talked about another future technology, i.e., Neural Processing Units or Neural accelerators for researchers to explore the possibilities. A brief explanation of big data and cloud data mining concepts are also presented here. Quantum Computing is a paradigm that is gaining pace in recent times which can also facilitate efficient and fast big data analytics in very little time. We have surveyed these three technologies and established their importance in big data mining with a holistic architecture by combining these three game-changers with the perspective of big data. We have also talked about another future technology, i.e., Neural Processing Units or Neural accelerators for researchers to explore the possibilities. A brief explanation of big data and cloud data mining concepts are also presented here.
Green Computing for Big Data and Machine Learning
Barua H.B., Mondal K.C., Khatua S.
Conference paper, ACM International Conference Proceeding Series, 2022, DOI Link
View abstract ⏷
The current decade has beheld a tremendous spike in data volume, velocity, variety and many other such aspects which we call as Big Data and which gave birth to a new kind of science commonly known as "Data Science". With the "Data Apocalypse"in progress, it is evident that the conventional methods to handle these data would not suffice. We need distributed and parallel architectures like Cloud services (IaaS, PaaS, SaaS, STaaS, etc.). But is that enough to satisfy our needs? Here, we propose a tutorial in a very different direction when we are talking about Data Science, that is, bringing greenness in Big Data and Machine Learning (ML). We divide the tutorial into two parts primarily assuming that we are using cloud backbone for analytic and prediction tasks. The first part speaks about the techniques and tools to bring energy efficiency/greenness in the algorithmic and code level for Big Data and ML using Approximate Computing. The second part talks about the green techniques and power models at the infrastructural level for the cloud.
Determining Dark Diversity of Different Faunal Groups in Indian Estuarine Ecosystem: A New Approach with Computational Biodiversity
Ghosh M., Roy A., Mondal K.C.
Conference paper, Lecture Notes in Networks and Systems, 2022, DOI Link
View abstract ⏷
Computational Biodiversity can broadly be understood as the effort of computational approaches for exploring, interpreting, and analyzing biodiversity data. An enormous load of growing biodiversity data needs algorithmic care for accurate data management, and therefore the term computational biodiversity comes. Instead of relying purely on presence data, the probabilistic forecast of member distribution including the regions of not occurrence can neutralize biodiversity loss by restoring potential ecosystems. This paper is aiming at revealing the perspective of computational biodiversity as a counteract for biodiversity loss by correlating the concept of dark diversity. The computation of the dark diversity is accompanied by a data mining algorithm for establishing rules with more nobility to manage the depletion of biodiversity. We generate a dataset for the Indian estuarine ecosystem and show the use of our approach by ending up with rules worthwhile for the ecologists. These would step up reinforcing biological diversity via introducing or rehabilitating specific faunal groups to an estuary under survey.
Knowledge Discovery of Sundarban Mangrove Species: A Way Forward for Managing Species Biodiversity
Ghosh M., Roy A., Mondal K.C.
Article, SN Computer Science, 2022, DOI Link
View abstract ⏷
The Mangrove ecosystem is continuously losing its dignity. A few studies have focused on understanding the changing behavior of Sundarban Mangrove Forest. However, knowledge-based database interpretation and employable pattern extraction may be an efficient approach to stand against the degrading nature of the mangrove ecosystem. Comprehending the gravity of the present scenario, the main contribution of this paper lies in the task of information retrieval by assessing the natural growth of native mangrove species of Sundarban. We have followed a methodology that makes use of association rule mining and biclustering approaches in order to come up with an off-the-shelf mechanism to analyze the data. This explores rules showing the effect of soil pH, water salinity on mangrove community structure, and on individual mangrove species and finds relation to biodiversity indices. The rules can predict probable sites for mangrove species expansion by computing the probability of introducing a new species to a particular site. Our study also generates the frequently co-occurred species lists along with the supporting sites. It could help in mangrove ecosystem restoration by identifying the most probable species that is missing from a particular site, maybe due to the gradual historical disappearance. Hence, this analytical study would enhance the possibilities of restoration of the mangrove ecosystem under survey in a systematic and empirical way.
Analysis of Indian Estuarine Data of Flora & Fauna
Ghosh M., Roy A., Mondal K.C.
Conference paper, Lecture Notes in Networks and Systems, 2022, DOI Link
View abstract ⏷
Estuaries represent the transitional ecosystem between freshwater and marine environment. Being dominated by both kinds of aquatic realms, it offers one of the most diverse ecosystems. However, Indian estuaries need a more exhaustive survey for the proper management of the wetlands as the estuarine ecological niche of flora and fauna is at risk. Mainly anthropogenic movements including trading, industrial as well as recreational activities, are the underlying reasons behind the deteriorating estuarine ecosystem and biodiversity. Comprehending the importance of the estuarine ecosystem, this article is concentrating on knowledge discovery from Indian estuarine data of flora & fauna. Here, we show the efficient use of the combining approach for bi-clustering and association rule mining on a manually curated real dataset. We come up with a set of rules, presentable to the ecologists as it can summarize closely occurred member lists, predicted list of sites for member expansion, etc. Hence, our study would assist in reinforcing the estuarine diversity that could pioneer region-based further studies.
Integration of ETL in Cloud Using Spark for Streaming Data
Biswas N., Mondal K.C.
Conference paper, Lecture Notes in Networks and Systems, 2022, DOI Link
View abstract ⏷
Extract-Transform-Load (ETL) consists of a series of process which collects raw transactional data and reshapes it into clean information which is actionable by Business Intelligence in the future. Presently most organizations are considering moving towards cloud-based implementation for their mission-critical applications. This trend is also affecting the management of ETL processes in the Data warehouse environment. The limitations of the traditional ETL process and the benefits of moving ETL into the cloud are discussed in this paper. After that, challenges in cloud computing adoption regarding the ETL process are identified. Features offered by some leading cloud-enabled ETL solutions are incorporated herewith some brief analysis. This paper will also cover the general issues in cloud ETL both from the perspective of cloud consumers and service providers. A novel framework is designed to process streaming data coming from real-time data feed. The solution facilitates the rapid development and deployment of real-time ETL applications.
Recognition of co-existence pattern of salt marshes and mangroves for littoral forest restoration
Ghosh M., Mondal K.C., Roy A.
Article, Ecological Informatics, 2022, DOI Link
View abstract ⏷
Climate-change driven sea level rise causes a increase in salinity in coastal wetlands accelerating the alteration of the species composition. It triggers the gradual extinction of species, particularly the mangrove population which is intolerant of excessive salinity. Thus despite being crucial to a wide range of ecosystem services, mangroves have been identified as a vulnerable coastal biome. Hence restoration strategy of mangroves is undergoing rigorous research and experiments in literature at an interdisciplinary level. From a data-driven perspective, analysis of mangrove occurrence data could be the key to comprehend and predict mangrove behavior along different environmental parameters, and it could be important in formulating management strategy for mangrove rehabilitation and restoration. As salt marshes are the natural salt-accumulating halophytes, mitigating excessive salinity could be achieved by incorporating salt-marshes in mangrove restoration activities. This study intends to find a novel restoration strategy by assessing the frequent co-existence status of salt marshes, with the mangroves, and mangrove associates in different zones of degraded mangrove patches for species-rich plantation. To achieve this, we primarily design a novel methodological framework for the practice of knowledge discovery concerning the coexistence pattern of salt marshes, mangroves, and mangrove associates along with environmental parameters using a data mining paradigm of association rule mining. The proposed approach has the capability to uncover underlying facts and forecast likely facts that could automate the study in the field of ecological research to comprehend the occurrence of inter-species relationships. Our findings are based on published data gathered on the Sundarban Mangrove Forest, one of the world's most important littoral forests. The existing literature reinforces the findings that include all the sets of frequently co-occurring mangroves, their associates, and salt marshes along the salinity gradient of coastal Sundarbans. A detailed understanding of the occurrence patterns of all these, along with the environmental variables, would be able to promote decision-making strategy. This framework is effective for both academia and stakeholders, especially the foresters/ conservation planners, to regulate the spread of salt marshes and the restoration of mangroves as well.
Finding Prediction of Interaction Between SARS-CoV-2 and Human Protein: A Data-Driven Approach
Ghosh M., Sil P., Roy A., Fajriyah R., Mondal K.C.
Article, Journal of The Institution of Engineers (India): Series B, 2021, DOI Link
View abstract ⏷
COVID-19 pandemic defined a worldwide health crisis into a humanitarian crisis. Amid this global emergency, human civilization is under enormous strain since no proper therapeutic method is discovered yet. A wave of research effort has been put toward the invention of therapeutics and vaccines against COVID-19. Contrarily, the spread of this fatal virus has already infected millions of people and claimed many lives all over the world. Computational biology can attempt to understand the protein–protein interactions between the viral protein and host protein. Therefore, potential viral–host protein interactions can be identified which is known as crucial information toward the discovery of drugs. In this study, an approach was presented for predicting novel interactions from maximal biclusters. Additionally, the predicted interactions are verified from biological perspectives. For this, a study was conducted on the gene ontology and KEGG-pathway in relation to the newly predicted interactions.
A Double Threshold-Based Power-Aware Honey Bee Cloud Load Balancing Algorithm
Mondal A.S., Mukhopadhyay S., Mondal K.C., Chattopadhyay S.
Article, SN Computer Science, 2021, DOI Link
View abstract ⏷
Present-day advancement in cloud computing provides ICT infrastructure as a service on a pay per use. Cloud computing provides this infrastructure as a service and as service demand increases, service providers organize large-scale data centers with a lot of resources, and cause of huge greenhouse gases’ emission. This data center’s huge power demand necessitates the balancing of cloud load. To attain the optimum resource utilization, least processing time of CPU, minimal average response time, and avoiding over-load, cloud load balancing algorithms distributes workload across virtual machines. The key challenge here is to develop such a load balancing algorithm which consumes the least resources to fulfill the service demands. In this paper, a double threshold-based power-aware honey bee load balancing algorithm is proposed for the fair and even distribution of the incoming task requests to all the virtual machines. This paper compares the proposed algorithm with five widely used existing load balancing algorithms. Moreover, we have done the performance analysis using the popular CloudAnalyst simulation toolkit. Results of simulation showed that the proposed algorithm gives a note-worthy outcome for average response time, CPU cost, storage cost, memory cost, and energy consumption in cloud computing to show the resource utilization.
Optimization of Coverage Hole Identification in 5G SON Using Data Mining
Nandy B.D., Mondal K.C.
Conference paper, Lecture Notes in Networks and Systems, 2021, DOI Link
View abstract ⏷
Recently, Self Organizing Networks (SONs), have been immensely researched. SON solutions must become the brains of the 5G Network. 5G has been envisaged to deploy in above 6 GHz mostly in 28 GHz or above. Beam-forming will be used to mitigate coverage loss due to high propagation loss. Line of sight communication is a must for above 28 GHz operations thereby presenting severe challenges in maintaining UE connection. Optimizing coverage and capacity are the major requirements of operators. The main hindrance in achieving optimized coverage and capacity is the complexity of adjusting all the configuration parameters which are affecting them. Most of the existing solutions are based on 4G and currently, very few solutions have been proposed for 5G considering line of sight operations above 6 GHz. Here, firstly we identify the challenges which hamper the current SON paradigm from meeting the requirements of 5G in terms of providing coverage. We then propose a data mining technique based approach to optimize the working in terms of performance and capacity for coverage hole identification in 5G SON network.
Predicting novel interactions from HIV-1-human PPI data integrated with protein signatures and GO annotations
Pal D., Mondal K.C.
Article, International Journal of Bioinformatics Research and Applications, 2021, DOI Link
View abstract ⏷
The research on host-pathogen protein-protein interactions (PPIs), specifically HIV1-human PPIs becomes one of the most challenging areas of medical science for antiviral drug invention. In this paper, we propose a pattern mining based approach to predict novel interactions between HIV-1 and human proteins with an estimated confidence based on the experimentally validated known interactions integrated with protein signatures and gene ontology (GO) annotations (biological process, cellular component and molecular function) of human proteins. It results in predicting more potential interactions along with the corresponding signatures and GO terms. We validate our predicted interactions by finding evidences from the literature and comparing with the predictions made by different computational approaches. We believe that our predicted information on PPIs enlightens the PPI research field with greater knowledge and better understanding of viral replication process; subsequently enhancing the discovery of new drug targets.
Fact-based expert system for supplier selection with ERP data
Mondal K.C., Nandy B.D., Baidya A.
Book chapter, Studies in Computational Intelligence, 2020, DOI Link
View abstract ⏷
For any business enterprise, supply chain management (SCM) plays an important role in an organization’s decision- and profit-making process. A very crucial step in SCM is supplier selection. It is such a pivotal step because it deploys a large amount of a firm’s financial resources. In return, the firms expect significant interest from contracting with suppliers offering higher value. Any discrepancy in this process can lead to low SCM performance which in turn may cause financial losses as well as bring about a decline in the firm’s market performance. This paper deals with the development of a strictly fact-based expert system for appropriate supplier selection and shows how rules can be broken down into atomic clauses.
Role of Machine Learning in ETL Automation
Mondal K.C., Biswas N., Saha S.
Conference paper, ACM International Conference Proceeding Series, 2020, DOI Link
View abstract ⏷
In the current business landscape, real-time analysis of enterprise data is very crucial for decision-makers of the organization to take strategic resolution and stay ahead of the competitors. Most of the time, it happens that data is outdated by the time it reaches to the user. The organization needs reliable, up to minute information to make better proactive business decisions, improve the process and organizational efficiency. Availability of information and business-critical report at real-time can be achieved through an automated ETL process. Typically, running a data warehouse in an enterprise requires coordination of many operations across many teams including applications and database teams. Also, it required a lot of manual intervention, which is error-prone. Executing all related steps in correct sequences under accurate conditions can be a challenge. Automated ETL process helps to address all these problems. Moreover, the preprocessing of data is a crucial step for making data ready to load in a data warehouse for analysis. Machine learning-based preprocessing can be used to ensure the quality of data. In this paper, we have addressed the issues faced in traditional data warehouse related to availability as well as the quality of data. We have explained how to automate the ETL process and how machine learning can be leveraged in the ETL process so that the quality and availability of data does not ever have been compromised and reached to the user on a near real-time basis.
Efficient incremental loading in ETL processing for real-time data integration
Biswas N., Sarkar A., Mondal K.C.
Article, Innovations in Systems and Software Engineering, 2020, DOI Link
View abstract ⏷
ETL (extract transform load) is the widely used standard process for creating and maintaining a data warehouse (DW). ETL is the most resource-, cost- and time-demanding process in DW implementation and maintenance. Nowadays, many graphical user interfaces (GUI)-based solutions are available to facilitate the ETL processes. In spite of the high popularity of GUI-based tool, there is still some downside of such approach. This paper focuses on alternative ETL developmental approach taken by hand coding. In some contexts like research and academic work, it is appropriate to go for custom-coded solution which can be cheaper, faster and maintainable compared to any GUI-based tools. Some well-known code-based open-source ETL tools developed by the academic world have been studied in this article. Their architecture and implementation details are addressed here. The aim of this paper is to present a comparative evaluation of these code-based ETL tools. Finally, an efficient ETL model is designed to meet the near real-time responsibility of the present days.
A survey on cancer prediction and detection with data analysis
Nath A.S., Pal A., Mukhopadhyay S., Mondal K.C.
Review, Innovations in Systems and Software Engineering, 2020, DOI Link
View abstract ⏷
World Health Organization reports cancer as a leading cause worldwide in mortality and morbidity. Accurate and early cancer risk assessment in average- to high-risk population is vital in controlling the cancer-related suffering and mortality. Advanced bioinformatics and data mining techniques along with computer-aided cancer prediction and risk assessment are used extensively to assist in identifying the high-risk population as well as individual cancer diagnosis and treatment. An early detection minimizes the risk of cancer spreading to secondary sites and ensures appropriate treatment at the onset of the malignancy. The scope of our survey was to review over 90 publications centered around works done in the area of data analysis studies in the field of cancer prediction and detection. The motivation was to accumulate and categorize knowledge on the usage of data analytics for cancer prediction and detection. The aim was to do a comparative study of few of the major analytical approaches in cancer data analysis and highlight their effectiveness.
A comprehensive survey on cloud data mining (CDM) frameworks and algorithms
Barua H.B., Mondal K.C.
Article, ACM Computing Surveys, 2020, DOI Link
View abstract ⏷
Data mining is used for finding meaningful information out of a vast expanse of data. With the advent of Big Data concept, data mining has come to much more prominence. Discovering knowledge out of a gigantic volume of data efficiently is a major concern as the resources are limited. Cloud computing plays a major role in such a situation. Cloud data mining fuses the applicability of classical data mining with the promises of cloud computing. This allows it to perform knowledge discovery out of huge volumes of data with efficiency. This article presents the existing frameworks, services, platforms, and algorithms for cloud data mining. The frameworks and platforms are compared among each other based on similarity, data mining task support, parallelism, distribution, streaming data processing support, fault tolerance, security, memory types, storage systems, and others. Similarly, the algorithms are grouped on the basis of parallelism type, scalability, streaming data mining support, and types of data managed. We have also provided taxonomies on the basis of data mining techniques such as clustering, classification, and association rule mining. We also have attempted to discuss and identify the major applications of cloud data mining. The various taxonomies for cloud data mining frameworks, platforms, and algorithms have been identified. This article aims at gaining better insight into the present research realm and directing the future research toward efficient cloud data mining in future cloud systems.
Approximate Computing: A Survey of Recent Trends—Bringing Greenness to Computing and Communication
Barua H.B., Mondal K.C.
Review, Journal of The Institution of Engineers (India): Series B, 2019, DOI Link
View abstract ⏷
Energy-efficient computing is a much needed technological advantage for future. Approximate or inexact computing is a computing paradigm that can trade energy and computing time with accuracy of output. Recent years have seen a lot of researches in industry as well as academia. The aim of these researches is to fruitfully realize the dream of a greener and energy-efficient computing era. This paper presents a comprehensive and concise survey of the current research trends and contributions in energy-efficient computing from computational point of view. Recent developments in approximate computing hardware, software and approximate data communication have also been discussed in this article.
Dynamic FP Tree Based Rare Pattern Mining Using Multiple Item Supports Constraints
Biswas S., Mondal K.C.
Conference paper, Communications in Computer and Information Science, 2019, DOI Link
View abstract ⏷
Data mining is a fundamental ingredient for making association rules among the largest variety of itemsets. Rare pattern mining is extremely useful judgment to generate the unknown, hidden, unusual pattern, using predefined minimum support confidence constraint from transactional datasets. Rare association rule is related to rare items that represent useful knowledge. Mining rare patterns from those database is more interesting rather than frequent pattern mining. In this paper, we presents the taxonomy of different support constraint model for rare pattern mining. Also we have performed a comprehensive literature review on existing tree based rare pattern mining algorithms. Finally, we have proposed a multiple item support constraint based dynamic rare pattern tree approaches that only generates rare itemset without considering frequent itemsets generation.
Fault Analysis and Trend Prediction in Telecommunication Using Pattern Detection: Architecture, Case Study and Experimentation
Mondal K.C., Barua H.B.
Conference paper, Communications in Computer and Information Science, 2019, DOI Link
View abstract ⏷
In recent years, almost every industry especially digital, e-commerce and telecommunication experienced exponential growth of data. Harvesting knowledge in these highly dynamic databases and finding closed patterns to analyze modern trends attracted considerable interest in this decade. To group similar types of data object or identify regions where data density is high in a dataset and forming clusters is the birds eye of researchers. Data mining and Business analytics have became an integral part of Telecommunication industry for extracting usage patterns of new as well as profiled customers and retaining them in competitive market. Providing better services without interruption is an essential aspect as this gains the confidence of the customers. In this paper, we have stated various challenges faced by the industry and have proposed a unified architecture for Telecom data analytics. Case studies have been put forward for Customer usage pattern detection and Network fault analysis using clustering and bi-clustering techniques to group and segment customer for business predictions as well as to identify faulty nodes in the network and predict network failures.
Empirical Analysis of Programmable ETL Tools
Biswas N., Sarkar A., Mondal K.C.
Conference paper, Communications in Computer and Information Science, 2019, DOI Link
View abstract ⏷
ETL (Extract Transform Load) is the widely used standard process for creating and maintaining a Data Warehouse (DW). ETL is the most resource, cost and time demanding process in DW implementation and maintenance. Now a days, many Graphical User Interfaces (GUI) based solutions are available to facilitate the ETL processes. In spite of the high popularity of GUI based tool, there is still some downside of such approach. This paper focuses on alternative ETL developmental approach taken by hand coding. In some context, it is appropriate to custom develop an ETL code which can be cheaper, faster and maintainable. Some well-known code based open source ETL tool (Pygrametl, Petl, Scriptella, R_etl) developed by the academic world has been studied in this article. Their architecture and implementation details are addressed here. The aim of this paper is to present a comparative evaluation of these code based ETL tools. Not to acclaim that code based ETL is superior to GUI based approach. It depends on the particular requirement, data strategy and infrastructure of any organization to choose the path between Code based and GUI based approach.
Performance analysis of structured, un-structured, and cloud storage systems
Mondal A.S., Sanyal M., Chattapadhyay S., Mondal K.C.
Article, International Journal of Ambient Computing and Intelligence, 2019, DOI Link
View abstract ⏷
Big Data management is an interesting research challenge for all storage vendors. Since data can be structured or unstructured, hence variety of storage systems has been designed to meet storage requirement as per organization's demands. The article focuses on different kinds of storage systems, their architecture and implementations. The first portion of the article describes different examples of structured (PostgreSQL) and unstructured databases (MongoDB, OrientDB and Neo4j) along with data models and comparative performance analysis between them. The second portion of the paper focuses on cloud storage systems. As an example of cloud storage, Google Cloud Storage and mainly its implementation details have been discussed. The aim of the article is not to eulogize any particular storage system, but to clearly point out that every storage has a role to play in the industry. It depends on the enterprise to identify the requirements and deploy the storage systems.
A new approach for conceptual extraction-transformation-loading process modeling
Biswas N., Chattapadhyay S., Mahapatra G., Chatterjee S., Mondal K.C.
Article, International Journal of Ambient Computing and Intelligence, 2019, DOI Link
View abstract ⏷
Erroneous or incomplete data generated from various sources can have direct impact in business analysis. Extracted data from sources need to load into data warehouse after required transformation to reduce error and minimize data loss. This process is also known as Extraction-Transformation- Loading (ETL). High-level view of the system activities can be visualized by conceptual modeling of ETL process. It provides the advantage of pre-identification of system error, cost minimization, scope and risk assessment etc. A new modeling approach is proposed for conceptualization ETL process by using a standard Systems Modeling Language (SysML). For handling increasing complexity of any system model, it is preferable to go through verification and validation process in early stage of system development. In this article, the authors' previous work is extended by presenting a MBSE based approach to automate the SysML model's validation by using No Magic simulator. Here, the main objective is to overcome the gap between modeling and simulation and to examine the performance of the proposed SysML model. The usefulness of the authors' approach is exhibited by using a use case scenario.
A hybrid intrusion detection system for hierarchical filtration of anomalies
Kar P., Banerjee S., Mondal K.C., Mahapatra G., Chattopadhyay S.
Conference paper, Smart Innovation, Systems and Technologies, 2019, DOI Link
View abstract ⏷
Network Intrusion Detection System (NIDS) deals with perusal of network traffics for the revelation of malicious activities and network attacks. The diversity of approaches related to NIDS, however, is commensurable with the drawbacks associated with the techniques. In this paper, an NIDS has been proposed that aims at hierarchical filtration of intrusions. The experimental analysis has been performed using KDD Cup’99 and NSL-KDD, from which, it can be clearly inferred that the proposed technique detects the attacks with high accuracy rates, high detection rates, and low false alarm. The run-time analysis of the proposed algorithm depicts the feasibility of its usage and its improvement over existing algorithms.
Design and Implementation of an Improved Data Warehouse on Clinical Data
Garain N., Chattopadhyay S., Mahapatra G., Chatterjee S., Mondal K.C.
Conference paper, Communications in Computer and Information Science, 2019, DOI Link
View abstract ⏷
Data Warehouse is a repository to store huge detailed and summaries data for historical data analysis. In a decision support system which stores data from remote, complex and heterogeneous operational data sources. A clinical data warehouse contains complex, heterogeneous data from different data sources. In literature, there are different data warehouse architectures are present with there own design issues, which are relevant to different application areas. In this paper, we proposed a conceptual and logical view of data warehouse architecture along with physical implementation of the data warehouse. Our main focus in this paper is to efficiently handle the complex heterogeneous medical data stored into the warehouse and improve the performance of data warehouse for data analysis. Here, we proposed a partitioning concept of the dimension tables and fact tables for optimizing the response time, minimizing the disk IO, along with reducing the joining cost of the data warehouse. To show the effectiveness of our system, we, compare with different joining techniques of the dimension and fact tables of fact-consolidated data warehouse schema. A mathematical cost model of disk IO optimization is being calculated. SQL window partitioning techniques are being used for data analysis of the proposed data warehouse. After storing complex heterogeneous data in well organized and efficient way in a data warehouse, efficient searching techniques need to be incorporated. Here, bitmap indexing technique is used for the purpose.
Parallel apriori based distributed association rule mining: A comprehensive survey
Biswas S., Biswas N., Mondal K.C.
Conference paper, Proceedings - 2018 4th IEEE International Conference on Research in Computational Intelligence and Communication Networks, ICRCICN 2018, 2018, DOI Link
View abstract ⏷
Association rule mining (ARM) has been paid more attention of both data mining users and database researchers in the last decade. Generation of various association rules from large distributed databases is the crucial task due to its intrinsic distribution of data sources. Identifying these type of distributed data sources requires a deep knowledge on data mining and planning for deployment in distributed environment. In this paper, a survey of the distributed framework for ARM is presented. It is observed that the parallelized nature of Apriori, Hadoop, MapReduce and Spark proves to be very efficient in Distributed association rule mining (DARM) environment. We expect that, the comprehensive review, references cited will convey foremost hypothetical issues and a guideline to the researcher in interesting research direction.
A factual analysis of improved python implementation of apriori algorithm
Mondal K.C., Nandy B.D., Baidya A.
Book chapter, Methodologies and Application Issues of Contemporary Computing Framework, 2018, DOI Link
View abstract ⏷
Data mining, also known as Knowledge Discovery in Databases (KDD), includes the task to find anomalies, correlations, patterns, and trends to predict outcomes [1, 2]. Association rule mining is one of the most prominent data mining tasks along with classification and clustering is gaining much importance in recent years many application domains. In general, the KDD is a sequence of processes stated [3] as follows: • Data cleaning which includes the removal of noise and inconsistency from the data. • Data integration, where multiple data sources are combined and integrated into one. • Data selection, where data relevant to analysis task are retrieved. • Data transformation, in which data is transformed into forms appropriate for mining by performing several aggregation operations. • Data mining, which includes intelligent methods to extract various patterns in data. • Pattern evaluation, where the various patterns are evaluated and the ones truly representing the knowledge are identified. • Knowledge representation, including techniques to represent the knowledge.
A suffix tree based parallel approach for association rule mining and biclustering
Mondal K.C., Bhattacharya S., Mondal A.S.
Conference paper, 2016 International Conference on Computer, Electrical and Communication Engineering, ICCECE 2016, 2017, DOI Link
View abstract ⏷
Data mining is the process of analyzing raw data from very large databases to turn them into useful and previously unknown information. This helps in finding out interesting patterns, trends and relationships within data. Association rule mining and bi-clustering are two very important data mining tasks for many application domains, especially in bio-informatics. FIST is one of the very few algorithms which extracts bases of association rules and bi-clustering conjointly in a single process. FIST algorithm is based on frequent closed itemsets framework and uses a suffix tree based data structure for efficiency. However, due to its sequential execution approach, the traditional FIST algorithm suffers from efficiency problems in terms of execution time for very large data sets with high dimensionality. Here, a parallelized version of FIST algorithm is proposed to improve the performance. In the new parallelize version of FIST algorithm (ParaFIST), a multi-Threaded approach is taken to allow parallel processing of the suffix tree branches to achieve better execution time. We have used an example to demonstrate the correctness of the proposed algorithm.
SysML based conceptual ETL process modeling
Biswas N., Chattopadhyay S., Mahapatra G., Chatterjee S., Mondal K.C.
Conference paper, Communications in Computer and Information Science, 2017, DOI Link
View abstract ⏷
Data generated from various sources can be erroneous or incomplete which can have direct impact over business analysis. ETL (Extraction-Transformation-Loading) is a well-known process which extract data from different sources, transform those data into required format and finally load it into target data warehouse (DW). ETL performs an important role in data warehouse environment. Configuring an ETL process is one of the key factor having direct impact over cost, time and effort for establishment of a successful data warehouse. Conceptual modeling of ETL can give a high-level view of the system activities. It provides the advantage of pre-identification of system error, cost minimization, scope, risk assessment etc. Some research development has been done for modeling ETL process by applying UML, BPMN and Semantic Web at conceptual level. In this paper, we propose a new approach for conceptual modeling of ETL process by using a new standard Systems Modeling Language (SysML). SysML extends UML features with much more clear semantics from System Engineering point of view. We have shown the usefulness of our approach by exemplifying using a use case scenario.
Closure based integrated approach for associative classifier
Chowdhury S.B., Pal D., Sarkar A., Mondal K.C.
Conference paper, Advances in Intelligent Systems and Computing, 2017, DOI Link
View abstract ⏷
Building a classifier using association rules for classification task is a supervised data mining technique called Associative Classification (AC). Experiments show that AC has higher degree of classification accuracy than traditional approaches. The learning methodology used in most of the AC algorithms is apriori based. Thus, these algorithms inherit some of the Apriori’s deficiencies like multiple scans of dataset and accumulative increase of number of rules. Closed itemset based approach is a solution to the above mentioned drawbacks. Here, we proposed a closed itemset based associative classifier (ACFIST) to generate the class association rules (CARs) along with biclusters. In this paper, we have also focused on generating lossless and condensed set of rules as it is based on closed concept. Experiments done on benchmark datasets to show the amount of result it is generating.
Comparative analysis of structured and un-structured databases
Mondal A.S., Sanyal M., Chattopadhyay S., Mondal K.C.
Conference paper, Communications in Computer and Information Science, 2017, DOI Link
View abstract ⏷
The introduction of relational database systems helped in faster transactions compared to the existing system for handling structured data. However, in course of time the cost of storing huge volume of unstructured data became an issue in traditional relational database systems. This is where some unstructured database systems like NoSQL databases were introduced in the domain to store unstructured data. This paper focuses on four different structured(PostgreSQL) and un-structured database systems(MongoDB, OrientDB and Neo4j). In this paper, we will eventually see the different kind of data models they follow and analyze their comparative performances by experimental evidences.
Brief review on optimal suffix data structures
Mondal K.C., Paul A., Sarkar A.
Conference paper, Advances in Intelligent Systems and Computing, 2017, DOI Link
View abstract ⏷
Suffix tree is a fundamental data structure in the area of combinatorial pattern matching. It has many elegant applications in almost all areas of data mining. This is an efficient data structure for finding solutions in these areas but occupying good amount of space is the major disadvantage of it. Optimizing this data structure has been an active area of research ever since this data structure has been introduced. Presenting major works on optimization of suffix tree is the matter of this article. Optimization in terms of space required to store the suffix tree or time complexity associated with the construction of the tree or performing operation like searching on the tree are major attraction for researcher over the years. In this article, we have presented different forms of this data structure and comparison between them have been studied. A comparative study on different algorithms of these data structures which turns out to be optimized versions of suffix tree in terms of space and time or both required to construct the tree or the time required to perform a search operation on the tree have been presented.
Comparative study of parallelism on data mining
Mondal K.C., Bhattacharya S., Sarkar A.
Conference paper, Advances in Intelligent Systems and Computing, 2017, DOI Link
View abstract ⏷
Today’s world has seen a massive explosion in various kinds of data having some unique characteristics such as high-dimensionality and heterogeneity. The need of automated data driven techniques has become a necessity to extract useful information from this huge and diverse data sets. Data mining is an important step in the process of knowledge discovery in databases (KDD) and focuses on discovering hidden information in data that go beyond simple analysis. Traditional data mining methods are often found inefficient and unsuitable in analyzing today’s data sets due to their heterogeneity, massive size and high-dimensionality. So, the need of parallelization of traditional data mining algorithms has almost become inevitable but challenging considering available hardware and software solutions. The main objective of this paper is to look at the need and limitations of parallelization of data mining algorithms and findingways to achieve the best. In this comparative study, we took a look at different parallel computer architectures, well proven parallelization methods, and programming language of choice.
Knowledge discovery from HIV-1-human PPIs assimilating interaction keywords
Pal D., Mondal A.S., Mondal K.C.
Conference paper, 2016 International Conference on Computer, Electrical and Communication Engineering, ICCECE 2016, 2017, DOI Link
View abstract ⏷
Today's world is gradually getting agitated by Human Immunodeficiency Virus-Type 1 due to its pervasive and death-dealing nature. The virus replicates by exploiting a complex interaction network of HIV-1 and human proteins and destructs human immunity power, gradually leading to AIDS. Anti-HIV drugs are designed to utilize the information on viral-host protein-protein interactions (PPIs), so that the viral replication and infection can be prevented. In this article, we have presented an effective computational approach using pattern-mining based algorithm in order to predict novel interactions between HIV-1 and human proteins based on the experimentally validated interactions curated in public PPI database. Additionally, we have tried to provide the information on interaction types associated with each of the predicted interactions with an estimated confidence. Further, we have analyzed our predicted interactions by finding overlap with other studies. We believe that this article would enhance the discovery of HIV-1 medications in a way of designing the drug targets.
A complete review of computational methods for human and HIV-1 protein interaction prediction
Pal D., Mondal K.C.
Article, International Journal of Bioinformatics Research and Applications, 2016, DOI Link
View abstract ⏷
Human Immunodeficiency Virus Type 1 (HIV-1) has grabbed the attention of virologists in recent times owing to its life-threatening nature and epidemic spread throughout the globe. The virus exploits a complex interaction network of HIV-1 and human proteins for replication, and causes destruction to the human immunity power. Antiviral drugs are designed to utilise the information on viral-host Protein-Protein Interactions (PPIs), so that the viral replication and infection can be prevented. Therefore, the prediction of novel interactions based on experimentally validated interactions, curated in the public PPI database, could help in discovering new therapeutic targets. This article gives an overview of HIV-1 proteins and their role in virus-replication followed by a discussion on different types of antiretroviral drugs and HIV-1- human PPI database. Thereafter, we have presented a brief explanation of different computational approaches adopted to predict new HIV-1-human PPIs along with a comparative study among them.
Galois closure based association rule mining from biological data
Mondal K.C., Pasquier N.
Book chapter, Biological Knowledge Discovery Handbook: Preprocessing, Mining and Postprocessing of Biological Data, 2014, DOI Link
View abstract ⏷
This chapter presents the different theoretical frameworks, condensed representations, interestingness measures, and biological applications of association rule mining. It presents concepts and frameworks for association rule mining and their applications in biology, especially in genomics and proteomics. The different theoretical frameworks proposed for itemset representation and frequent-itemset extraction are described here. The chapter describes the proposed solutions for reducing sets of extracted association rules to the most relevant and useful rules. This is an important topic in association rule mining as several thousands, and sometimes millions, of association rules can be generated for large databases, with most often numerous redundant information. The different condensed representations of association rules, such as minimal covers, bases, and inference systems, along with their properties are presented in the chapter. It presents subjective and objective interestingness measures that can be used for selecting the most relevant association rules.
RCA as a data transforming method: A comparison with propositionalisation
Dolques X., Mondal K.C., Braud A., Huchard M., Le Ber F.
Conference paper, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, DOI Link
View abstract ⏷
This paper aims at comparing transformation-based approaches built to deal with relational data, and in particular two approaches which have emerged in two different communities: Relational Concept Analysis (RCA), based on an iterative use of the classical Formal Concept Analysis (FCA) approach, and Propositionalisation coming from the Inductive Logic Programming community. Both approaches work by transforming a complex problem into a simpler one, namely transforming a database consisting of several tables into a single table. For this purpose, a main table is chosen and new attributes capturing the information from the other tables are built and added to this table. We show the similarities between those transformations for what concerns the principles underlying them, the semantics of the built attributes and the result of a classification performed by FCA on the enriched table. This is illustrated on a simple dataset and we also present a synthetic comparison based on a larger dataset from the hydrological domain. © 2014 Springer International Publishing.
A new approach for association rule mining and bi-clustering using formal concept analysis
Mondal K.C., Pasquier N., Mukhopadhyay A., Maulik U., Bandhopadyay S.
Conference paper, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2012, DOI Link
View abstract ⏷
Association rule mining and bi-clustering are data mining tasks that have become very popular in many application domains, particularly in bioinformatics. However, to our knowledge, no algorithm was introduced for performing these two tasks in one process. We propose a new approach called FIST for extracting bases of extended association rules and conceptual bi-clusters conjointly. This approach is based on the frequent closed itemsets framework and requires a unique scan of the database. It uses a new suffix tree based data structure to reduce memory usage and improve the extraction efficiency, allowing parallel processing of the tree branches. Experiments conducted to assess its applicability to very large datasets show that FIST memory requirements and execution times are in most cases equivalent to frequent closed itemsets based algorithms and lower than frequent itemsets based algorithms. © 2012 Springer-Verlag.
Prediction of protein interactions on HIV-1-human PPI data using a novel closure-based integrated approach
Mondal K.C., Pasquier N., Mukhopadhyay A., Da Costa Pereira C., Maulik U., Tettamanzi A.G.B.
Conference paper, BIOINFORMATICS 2012 - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms, 2012,
View abstract ⏷
Discovering Protein-Protein Interactions (PPI) is a new interesting challenge in computational biology. Identifying interactions among proteins was shown to be useful for finding new drugs and preventing several kinds of diseases. The identification of interactions between HIV-1 proteins and Human proteins is a particular PPI problem whose study might lead to the discovery of drugs and important interactions responsible for AIDS. We present the FIST algorithm for extracting hierarchical bi-clusters and minimal covers of association rules in one process. This algorithm is based on the frequent closed itemsets framework to efficiently generate a hierarchy of conceptual clusters and non-redundant sets of association rules with supporting object lists. Experiments conducted on a HIV-1 and Human proteins interaction dataset show that the approach efficiently identifies interactions previously predicted in the literature and can be used to predict new interactions based on previous biological knowledge.
MOSCFRA: A multi-objective genetic approach for simultaneous clustering and gene ranking
Mondal K.C., Mukhopadhyay A., Maulik U., Bandhyapadhyay S., Pasquier N.
Conference paper, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2011, DOI Link
View abstract ⏷
Microarray experiments generate a large amount of data which is used to discover the genetic background of diseases and to know the characteristics of genes. Clustering the tissue samples according to their co-expressed behavior and characteristics is an important tool for partitioning the dataset. Finding the clusters of a given dataset is a difficult task. This task of clustering is even more difficult when we try to find the rank of each gene, which is known as Gene Ranking, according to their abilities to distinguish different classes of samples. In the literature, many algorithms are available for sample clustering and gene ranking or selection, separately. A few algorithms are also available for simultaneous clustering and feature selection. In this article, we have proposed a new approach for clustering the samples and ranking the genes, simultaneously. A novel encoding technique for the chromosomes is proposed for this purpose and the work is accompleshed using a multi-objective evolutionary technique. Results have been demonstrated for both artificial and real-life gene expression data sets. © 2011 Springer-Verlag Berlin Heidelberg.