Surani Matharaarachchi


Welcome! I am an Assistant Professor of Data Science at the New York Institute of Technology (NYIT), Vancouver campus. I hold a Msc and a PhD in Statistics from the University of Manitoba, Canada, where my research—guided by Prof. Saman Muthukumarana and Dr. Mike Domaratzki—focused on creating innovative methods to address class imbalance in classification tasks, enhancing both accuracy and interpretability.

With over nine years of combined academic, industry and government experience, I have worked at the intersection of statistics, machine learning, and real-world applications, delivering impactful solutions in public health, education, and beyond. My passion lies in developing methods that are not only powerful, but also interpretable and accessible, enabling data-driven decision-making for meaningful change.


Contact

You can connect with me using any of the platforms below.


Last updated: August 15, 2025

Publications

Peer-Reviewed Publications

Enhancing SMOTE for Imbalanced Data with Abnormal Minority Instances

Matharaarachchi S., Domaratzki M, Muthukumarana S. (2024). “Enhancing SMOTE for Imbalanced Data with Abnormal Minority Instances.” Machine Learning with Applications.

Assessing feature selection method performance with class imbalance data

Matharaarachchi, S., M. Domaratzki, and S. Muthukumarana (2021). “Assessing feature selection method performance with class imbalance data.” Machine Learning with Applications. This paper was awarded with the Reproducibility Badge Initiative (RBI).

Discovering long COVID symptom patterns: association rule mining and sentiment analysis in social media tweets

Matharaarachchi S., Domaratzki M., Katz A., Muthukumarana S. (2022). “Discovering Long COVID Symptom Patterns: Association Rule Mining and Sentiment Analysis in Social Media Tweets.” JMIR Form Res.

Modeling and feature assessment of the sleep quality among chronic kidney disease patients

Matharaarachchi S., Domaratzki M., Marasinghe C., Muthukumarana S., and Tennakoon V. (2022). “Modeling and Feature Assessment of the Sleep Quality among Chronic Kidney Disease Patients.” Sleep Epidemiology.

Minimizing features while maintaining performance in data classification problems

Matharaarachchi S., Domaratzki M., Muthukumarana S. (2022). “Minimizing features while maintaining performance in data classification problems.” PeerJ Computer Science 8:e1081.

A population data-driven approach to identifying ‘Long COVID’ cases in support of diagnosis and treatment.

Enns, J., Katz, A., Yogendran, M., Urquia, M., Muthukumarana S., Matharaarachchi, S., Singer, A., Nickel, N., Star, L., Cavett, T., Keynan, Y., Lix, L. and Sanchez-Ramirez, D. (2022) “A population data-driven approach to identifying ‘Long COVID’ cases in support of diagnosis and treatment.” International Journal of Population Data Science, 7(3).

Identifying people with post-COVID condition using linked, population-based administrative health data from Manitoba, Canada: prevalence and predictors in a cohort of COVID-positive individuals.

Katz, A., Ekuma, O., Enns, J., Cavett, T., Singer, A., Sanchez-Ramirez, D., Keynan, Y., Lix, Y., Walld, R., Yogendran, M., Nickel, N., Urquia, M., Star, L., Olafson, K., Logsetty, S., Spiwak, R., Waruk, J., Matharaarachichi, S. (2025) “Identifying people with post-COVID condition using linked, population-based administrative health data from Manitoba, Canada: prevalence and predictors in a cohort of COVID-positive individuals.” BMJ open, 15 (1), e087920.

Manuscripts Under Review

Sequential Bayesian Estimation of the F1 Score Using the Dirichlet-Multinomial Model.

Matharaarachchi S., M. Turgeon, M. Domaratzki, S. Muthukumarana. (2025). “Sequential Bayesian Estimation of the F1 Score Using the Dirichlet-Multinomial Model.” International Journal of Data Science and Analytics.

Long COVID Prediction in Manitoba Using Clinical Notes Data: A Machine Learning Approach

Matharaarachchi S., M. Domaratzki, A. Katz, S. Muthukumarana. (2024). “Long COVID Prediction in Manitoba Using Clinical Notes Data: A Machine Learning Approach.”

Deep-ExtSMOTE: Integrating Autoencoders for Advanced Mitigation of Class Imbalance in High-Dimensional Data Classification.

Matharaarachchi S., M. Domaratzki, S. Muthukumarana. (2024). “Deep-ExtSMOTE: Integrating Autoencoders for Advanced Mitigation of Class Imbalance in High-Dimensional Data Classification.” Journal of Big Data Research.

Education

University of Manitoba

Doctor of Philosophy
Statistics

GPA: 4.13/4.5

Thesis: New Developments for Addressing Class Imbalance Issue in Classification Tasks.

September 2021 - November 2024

University of Sri Jayawardenapura

Bachelor of Science (First Class) - Honors
Statistics (Special)

GPA: 3.8/4.0

First two years included coursework in Mathematics, Computer Science, and Statistics.

Dissertation: Study on Parliamentary General Electoral Systems in Sri Lanka.

November 2011 - December 2015

Teaching

As a Full-time Faculty

DTSC 502 - Fundamental Probability and Statistics for Data Science

Fall 2025

DTSC 610 - Programming for Data Science

Fall 2025



As a Sessional Instructor

STAT 1150 - Introduction to Basic Statistics and Computing

Summer 2022



As a Teaching Assisstant

Year Course Term
2022 STAT 4150 - Bayesian Analysis and Computing Fall 2022
2022 STAT 7270 - Bayesian Inference Fall 2022
2022 STAT 2000 - Basic Statistical Analysis 2 (n=2) Winter 2022
2021 STAT 4150 - Bayesian Analysis and Computing Fall 2021
2022 STAT 7270 - Bayesian Inference Fall 2021
2021 STAT 2000 - Basic Statistical Analysis 2 (n=2) Fall 2021
2021 STAT 2000 - Basic Statistical Analysis 2 Summer 2021
2021 STAT 2000 - Basic Statistical Analysis 2 (n=2) Winter 2021
2020 STAT 2000 - Basic Statistical Analysis 2 (n=2) Fall 2020
2020 STAT 4150 - Bayesian Analysis and Computing Winter 2020
2022 STAT 7270 - Bayesian Inference Winter 2020
2020 STAT 1000 - Basic Statistical Analysis 1 (n=2) Winter 2020
2019 STAT 1000 - Basic Statistical Analysis 1 Fall 2019

Note: n = Number of sections conducted for the same course.

Experience

Assistant Professor - Data Science
July 2022 - July 2025
Data Scientist, Department of Education and Early Childhood Learning (February 2024 - July 2025)
  • Led the development of three public education data dashboards on the government website as part of a key government initiative, along with internal apps, using Power BI Desktop and Power BI Server to deliver streamlined, data-driven insights supporting decision-making in the education sector.
  • Extracted, cleaned, and processed data from the Education Information System (EIS), an integrated database on Oracle and MSSQL, using SQL for ETL tasks and R and Python for advanced analysis and reporting.
  • Conducted data analyses and fulfilled data requests using R, delivering customized insights to stakeholders and generating comprehensive provincial test reports to support policy review and data-driven decision-making in education.
Leader in Training (LTP), Data Science Stream (December 2022 - February 2024)
  • Collaborated with multiple departments and agencies to support diverse projects, sharing expertise and aligning data-driven solutions with specific departmental needs. Served as an R support partner in the Data Science Practicum Program, contributing technical skills to enhance project outcomes.
  • Engaged in a project to develop predictive models for property assessments using XGBoost, performing data cleaning, pre-processing, feature engineering, hyper-parameter tuning, and model fitting to achieve over 90% accuracy, ultimately reducing manual workload by approximately 75%.
  • Evaluated the impact of COVID-19 on high school education outcomes, using regression models to quantify effects and identify the most vulnerable student groups through analysis of performance trends and demographic factors.
  • Investigated the effects of heat waves on health-related illnesses using data from MCHP and Environment Canada, to identify patterns and correlations between extreme heat events using logistic regression models. These findings support public health responses and preventative measures for vulnerable populations.
STEP Student, Data Science Program (July 2022 - December 2022)
  • Processed, cleansed, and verified the integrity of billing address data using Python NLP tools and data from Python REST services with the NRCAN API to extract geo-location information, reducing the need for manual address verification by 60%.
September 2019 - Present
Postdoctoral Researcher (January 2025 - August 2025)
  • Practical Bayesian framework for estimating the F1 score with uncertainty, using a Dirichlet-Multinomial model and sequential updates. Designed for streaming and online learning environments, it enables interpretable, real-time performance monitoring without retraining, supporting robust evaluation in imbalanced data scenarios.
Research Assisstant (September 2019 - November 2024)
  • Proposed advanced feature selection methods that outperform Recursive Feature Elimination (RFE) with SMOTE, achieving superior F1-scores in high-dimensional classification.
  • Applied association rule mining to detect Long COVID symptom patterns in over 1M social media records using Python’s NLP toolkit NLTK, uncovering key insights into symptom co-occurrence and progression.
  • Identified Long COVID patients in Manitoba by extracting symptoms from health data using NLP, machine learning, and resampling, achieving 0.95 sensitivity, 0.81 specificity, and a 0.94 AUC. This approach revealed 24.7% of cases as Long COVID—14 times the previously known cases. Collaborated with MCHP using their secure RAS platform for data analysis.
  • Developed 4 advanced SMOTE algorithms with self-developed code, using proximity-based, probabilistic, and Bayesian weighting mechanisms to address abnormal instances in minority classes. Applied in both simulated and real-world data, these methods improved representational accuracy for minority classes in highly imbalanced datasets.
  • Achieved a 25% improvement in average F1 Score with new Deep-ExtSMOTE by integrating autoencoders, Bayesian resampling, and tailored embeddings for continuous and categorical data in TensorFlow and Keras. This approach advanced class imbalance handling in high-dimensional data, with all computations were done on the Digital Research Alliance of Canada cluster.
  • Demonstrated strong independent research skills and effective time management by balancing complex algorithm development, data analysis, and publication efforts alongside full-time professional responsibilities, ensuring consistent progress and timely completion of all project milestones.
November 2020 - February 2021
Data Analytics Graduate Student Intern - Data Science
  • Assessed the spread of COVID-19 by predicting infections, recoveries, and deaths using time series predictive and SEIR models, processing and cleansing data from publicly available sources, and integrating Python REST services to streamline data analysis and reporting.

Self-Employed

Freelance Data Science Recruitment Consultant
December 2023
  • Provided freelance consulting services to Callia Inc. in Winnipeg, Canada, using expertise to offer feedback on the technical evaluation of data scientist candidates, conducting comprehensive test reviews, and collaborating with hiring managers for informed hiring decisions.
Data Scientist
  • Led the data integration for successfully implementing a dashboard, an analytics tool, and a mobile app with the OLAP Druid database, achieving a 60% reduction in data processing time, optimizing database architecture, developing data models, and ensuring data integrity at every stage as part of a new initiative.
  • Provided structured layouts and design specifications for each tool to the designer and collaborated closely with software engineers and architects to ensure seamless integration and alignment with project goals.
  • Extracted data from MongoDB and Elasticsearch databases to perform daily reporting, using Kafka and Flink to streamline data processing workflows.
  • Led project coordination with client stakeholders and cross-functional teams, facilitating clear communication, aligning project goals, managing timelines, and ensuring deliverables met client expectations and quality standards.
September 2017 - September 2018
Data & Report Analyst
  • Led the backend development of interactive regression and time series widgets within a new dashboard, using Python to automate models for predicting future sales and trends. Integrated data from multiple sources to provide robust analytical insights and improve forecasting accuracy by 30%.
  • Implemented data warehouse and ETL processes with Pygrametl, managing and querying databases such as PostgreSQL, MySQL, Microsoft SQL Server, and BigQuery to enable efficient data storage, retrieval, and analysis.
  • Conducted unit testing and actively participated in the Quality Assurance process, identifying and resolving defects, collaborating with QA teams to improve test coverage, and ensuring the reliability and performance of application features before deployment.
  • Worked in an agile environment, collaborating closely with backend and frontend engineers, QA specialists, and project managers. Engaged in sprint planning, daily stand-ups, and retrospectives to ensure alignment, address challenges, and deliver high-quality features on schedule.

Presentations

2025

Advanced Techniques for Mitigating Abnormal Instances and Class Imbalance in High-Dimensional Data Classification.

Statistical Society of Canada (SSC) Annual Meeting 2025


2024

Uncovering Symptoms and Predicting Long COVID Using Social Media Tweets and Clinical Notes Data: A Machine Learning Approach (Invited).

International Statistics Conference 2024 (ISC2024), Colombo, Sri Lanka.


Deep-ExtSMOTE: Integrating Autoencoders for Advanced Mitigation of Class Imbalance in High-Dimensional and Big Data Classification (Invited).

International Statistics Conference 2024 (ISC2024), Colombo, Sri Lanka.

New Developments for Addressing Class Imbalance Issue in Classification Tasks.

PhD Theis Defense, Department of Statistics, Faculty of Graduate Studies, University of Manitoba.


Three Minute Thesis (3MT®).

Faculty of Graduate Studies, University of Manitoba.


Machine Learning-based Identification of Long COVID Syndrome: Leveraging Encounter Notes Symptoms (Invited).

4th International Conference on Future of Preventive Medicine & Public Health (Future of PMPH 2024).


Novel Approaches to Mitigate Abnormal Instances in Imbalanced Datasets - for Improved Classification Performance.

2024 WNAR/IMS/Graybill Annual Meeting, Fort Collins, Colorado - Student Paper Competition presentation.


2023

Long COVID Prediction in Manitoba Using Clinical Notes Data: A Machine Learning Approach.

CANSSI Show Case 2023


Data to Action Day 2023, organized by the Data Science Program, Government of Manitoba.


2022

Discovering long COVID symptom patterns: Association rule mining in social media tweets.

Statistical Society of Canada (SSC) Annual Meeting 2022.


2021

Modeling and Inference with Feature Importance for Assessing the Quality of Sleep among Chronic Kidney Disease Patients.

Joint Statistical Meetings (JSM) 2021.


Assessing Feature Selection Methods and Their Performance in High Dimensional Classification Problems.

MSc Theis Defense, Department of Statistics, Faculty of Graduate Studies, University of Manitoba.


Assessing Feature Selection Methods and their Performance in High-Dimensional Classification Problems.

Statistical Society of Canada (SSC) Annual Meeting 2021.

Research Interests

Machine Learning, Statistical Learning, Classification, Algorithmic Approaches, Deep Learning Techniques, Bayesian Methods, High Dimensional Data Analysis, Computational Statistics

My research focuses on machine learning, NLP, and statistical learning, with a special interest in high dimensional data analysis, feature engineering, class imbalance, LLMs, knowledge representation, and model optimization. I work on methods for high-dimensional data and data imbalance, developing Bayesian approaches and resampling techniques that enhance model accuracy in fields like healthcare and education. Additionally, I aim to bridge theory and practice, creating efficient, interpretable models that offer reliable, actionable insights for real-world applications.

Awards & Recognition

Scholarly & Professional Activities

As a Volenteer Journal Peer Reviewer



Other Academic Contributions