Surani Matharaarachchi

PhD in Statistics, University of Manitoba

I recently completed my PhD in the Department of Statistics at the University of Manitoba, with guidance from Dr. Saman Muthukumarana and Dr. Mike Domaratzki. My academic path has been grounded in the fields of statistics and machine learning, enriched by an MSc in Statistics from the University of Manitoba and a BSc in Statistics from the University of Sri Jayewardenepura, Sri Lanka.

My research focused on creating innovative methods to tackle class imbalance in classification tasks, enhancing model accuracy and interpretability. With over seven years of combined industry and academic research experience, I’ve led the design and implementation of machine learning solutions across diverse domains, from public health to education.

Currently, I am working as a Data Scientist, at the Government of Manitoba, where I apply my skills to support data-driven decision-making and policy formulation, contributing to impactful projects in the public sector.

Publications

Peer-Reviewed Publications

Enhancing SMOTE for Imbalanced Data with Abnormal Minority Instances

Matharaarachchi S., Domaratzki M, Muthukumarana S. (2024). “Enhancing SMOTE for Imbalanced Data with Abnormal Minority Instances.” Machine Learning with Applications.

Assessing feature selection method performance with class imbalance data

Matharaarachchi, S., M. Domaratzki, and S. Muthukumarana (2021). “Assessing feature selection method performance with class imbalance data.” Machine Learning with Applications. This paper was awarded with the Reproducibility Badge Initiative (RBI).

Discovering long COVID symptom patterns: association rule mining and sentiment analysis in social media tweets

Matharaarachchi S., Domaratzki M., Katz A., Muthukumarana S. (2022). “Discovering Long COVID Symptom Patterns: Association Rule Mining and Sentiment Analysis in Social Media Tweets.” JMIR Form Res.

Modeling and feature assessment of the sleep quality among chronic kidney disease patients

Matharaarachchi S., Domaratzki M., Marasinghe C., Muthukumarana S., and Tennakoon V. (2022). “Modeling and Feature Assessment of the Sleep Quality among Chronic Kidney Disease Patients.” Sleep Epidemiology.

Minimizing features while maintaining performance in data classification problems

Matharaarachchi S., Domaratzki M., Muthukumarana S. (2022). “Minimizing features while maintaining performance in data classification problems.” PeerJ Computer Science 8:e1081.

A population data-driven approach to identifying ‘Long COVID’ cases in support of diagnosis and treatment.

Enns, J., Katz, A., Yogendran, M., Urquia, M., Muthukumarana S., Matharaarachchi, S., Singer, A., Nickel, N., Star, L., Cavett, T., Keynan, Y., Lix, L. and Sanchez-Ramirez, D. (2022) “A population data-driven approach to identifying ‘Long COVID’ cases in support of diagnosis and treatment.” International Journal of Population Data Science, 7(3).

Manuscripts Under Review

Long COVID Prediction in Manitoba Using Clinical Notes Data: A Machine Learning Approach

Matharaarachchi S., M. Domaratzki, A. Katz, S. Muthukumarana. (2024). “Long COVID Prediction in Manitoba Using Clinical Notes Data: A Machine Learning Approach.” Intelligence-Based Medicine.

Deep-ExtSMOTE: Integrating Autoencoders for Advanced Mitigation of Class Imbalance in High-Dimensional Data Classification.

Matharaarachchi S., M. Domaratzki, S. Muthukumarana. (2024). “Deep-ExtSMOTE: Integrating Autoencoders for Advanced Mitigation of Class Imbalance in High-Dimensional Data Classification.” Journal of Data Science.

Identifying people with post-COVID condition using linked, population-based administrative health data from Manitoba, Canada: Prevalence and predictors in the COVID-positive population.

Katz A., O. Ekuma, J. E. Enns, T. Cavett, A. Singer, D. C. Sanchez-Ramirez, Y. Keynan, L. M. Lix, R. Walld, M. S. Yogendran, N. Nickel, M. L. Urquia, L. Star, K. Olafson, S. Logsetty, R. Spiwak, J. Waruk, S. Matharaarachichi. (2024). “Identifying people with post-COVID condition using linked, population-based administrative health data from Manitoba, Canada: Prevalence and predictors in the COVID-positive population.” BMJ Open.

Education

University of Manitoba

Doctor of Philosophy
Statistics

GPA: 4.13/4.5

Thesis: New Developments for Addressing Class Imbalance Issue in Classification Tasks.

September 2021 - November 2024

University of Sri Jayawardenapura

Bachelor of Science (First Class) - Honors
Statistics (Special)

GPA: 3.8/4.0

First two years included coursework in Mathematics, Computer Science, and Statistics.

Dissertation: Study on Parliamentary General Electoral Systems in Sri Lanka.

November 2011 - December 2015

Teaching

As a Sessional Instructor

STAT 1150 - Introduction to Basic Statistics and Computing

Summer 2022



As a Teaching Assisstant

Year Course Term
2022 STAT 4150 - Bayesian Analysis and Computing Fall 2022
2022 STAT 7270 - Bayesian Inference Fall 2022
2022 STAT 2000 - Basic Statistical Analysis 2 (n=2) Winter 2022
2021 STAT 4150 - Bayesian Analysis and Computing Fall 2021
2022 STAT 7270 - Bayesian Inference Fall 2021
2021 STAT 2000 - Basic Statistical Analysis 2 (n=2) Fall 2021
2021 STAT 2000 - Basic Statistical Analysis 2 Summer 2021
2021 STAT 2000 - Basic Statistical Analysis 2 (n=2) Winter 2021
2020 STAT 2000 - Basic Statistical Analysis 2 (n=2) Fall 2020
2020 STAT 4150 - Bayesian Analysis and Computing Winter 2020
2022 STAT 7270 - Bayesian Inference Winter 2020
2020 STAT 1000 - Basic Statistical Analysis 1 (n=2) Winter 2020
2019 STAT 1000 - Basic Statistical Analysis 1 Fall 2019

Note: n = Number of sections conducted for the same course.

Experience

July 2022 - Present
Data Scientist, Department of Education and Early Childhood Learning (February 2024 - Present)
  • Led the development of three public education data dashboards on the government website as part of a key government initiative, along with internal apps, using Power BI Desktop and Power BI Server to deliver streamlined, data-driven insights supporting decision-making in the education sector.
  • Extracted, cleaned, and processed data from the Education Information System (EIS), an integrated database on Oracle and MSSQL, using SQL for ETL tasks and R and Python for advanced analysis and reporting.
  • Conducted data analyses and fulfilled data requests using R, delivering customized insights to stakeholders and generating comprehensive provincial test reports to support policy review and data-driven decision-making in education.
Leader in Training (LTP), Data Science Stream (December 2022 - February 2024)
  • Collaborated with multiple departments and agencies to support diverse projects, sharing expertise and aligning data-driven solutions with specific departmental needs. Served as an R support partner in the Data Science Practicum Program, contributing technical skills to enhance project outcomes.
  • Engaged in a project to develop predictive models for property assessments using XGBoost, performing data cleaning, pre-processing, feature engineering, hyper-parameter tuning, and model fitting to achieve over 90% accuracy, ultimately reducing manual workload by approximately 75%.
  • Evaluated the impact of COVID-19 on high school education outcomes, using regression models to quantify effects and identify the most vulnerable student groups through analysis of performance trends and demographic factors.
  • Investigated the effects of heat waves on health-related illnesses using data from MCHP and Environment Canada, to identify patterns and correlations between extreme heat events using logistic regression models. These findings support public health responses and preventative measures for vulnerable populations.
STEP Student, Data Science Program (July 2022 - December 2022)
  • Processed, cleansed, and verified the integrity of billing address data using Python NLP tools and data from Python REST services with the NRCAN API to extract geo-location information, reducing the need for manual address verification by 60%.
September 2019 - November 2024
Research Assisstant
  • Proposed advanced feature selection methods that outperform Recursive Feature Elimination (RFE) with SMOTE, achieving superior F1-scores in high-dimensional classification.
  • Applied association rule mining to detect Long COVID symptom patterns in over 1M social media records using Python’s NLP toolkit NLTK and LLMs, uncovering key insights into symptom co-occurrence and progression.
  • Identified Long COVID patients in Manitoba by extracting symptoms from health data using NLP, machine learning, and resampling, achieving 0.95 sensitivity, 0.81 specificity, and a 0.94 AUC. This approach revealed 24.7% of cases as Long COVID—14 times the previously known cases. Collaborated with MCHP using their secure RAS platform for data analysis.
  • Developed 4 advanced SMOTE algorithms with self-developed code, using proximity-based, probabilistic, and Bayesian weighting mechanisms to address abnormal instances in minority classes. Applied in both simulated and real-world data, these methods improved representational accuracy for minority classes in highly imbalanced datasets.
  • Achieved a 25% improvement in average F1 Score with new Deep-ExtSMOTE by integrating autoencoders, Bayesian resampling, and tailored embeddings for continuous and categorical data in TensorFlow and Keras. This approach advanced class imbalance handling in high-dimensional data, with all computations were done on the Digital Research Alliance of Canada cluster.
  • Demonstrated strong independent research skills and effective time management by balancing complex algorithm development, data analysis, and publication efforts alongside full-time professional responsibilities, ensuring consistent progress and timely completion of all project milestones.
November 2020 - February 2021
Data Analytics Graduate Student Intern - Data Science
  • Assessed the spread of COVID-19 by predicting infections, recoveries, and deaths using time series predictive and SEIR models, processing and cleansing data from publicly available sources, and integrating Python REST services to streamline data analysis and reporting.

Self-Employed

Freelance Data Science Recruitment Consultant
December 2023
  • Provided freelance consulting services to Callia Inc. in Winnipeg, Canada, using expertise to offer feedback on the technical evaluation of data scientist candidates, conducting comprehensive test reviews, and collaborating with hiring managers for informed hiring decisions.
Data Scientist
  • Led the data integration for successfully implementing a dashboard, an analytics tool, and a mobile app with the OLAP Druid database, achieving a 60% reduction in data processing time, optimizing database architecture, developing data models, and ensuring data integrity at every stage as part of a new initiative.
  • Provided structured layouts and design specifications for each tool to the designer and collaborated closely with software engineers and architects to ensure seamless integration and alignment with project goals.
  • Extracted data from MongoDB and Elasticsearch databases to perform daily reporting, using Kafka and Flink to streamline data processing workflows.
  • Led project coordination with client stakeholders and cross-functional teams, facilitating clear communication, aligning project goals, managing timelines, and ensuring deliverables met client expectations and quality standards.
September 2017 - September 2018
Data & Report Analyst
  • Led the backend development of interactive regression and time series widgets within a new dashboard, using Python to automate models for predicting future sales and trends. Integrated data from multiple sources to provide robust analytical insights and improve forecasting accuracy by 30%.
  • Implemented data warehouse and ETL processes with Pygrametl, managing and querying databases such as PostgreSQL, MySQL, Microsoft SQL Server, and BigQuery to enable efficient data storage, retrieval, and analysis.
  • Conducted unit testing and actively participated in the Quality Assurance process, identifying and resolving defects, collaborating with QA teams to improve test coverage, and ensuring the reliability and performance of application features before deployment.
  • Worked in an agile environment, collaborating closely with backend and frontend engineers, QA specialists, and project managers. Engaged in sprint planning, daily stand-ups, and retrospectives to ensure alignment, address challenges, and deliver high-quality features on schedule.

Presentations

2024

Uncovering Symptoms and Predicting Long COVID Using Social Media Tweets and Clinical Notes Data: A Machine Learning Approach (Invited).

International Statistics Conference 2024 (ISC2024), Colombo, Sri Lanka.

Deep-ExtSMOTE: Integrating Autoencoders for Advanced Mitigation of Class Imbalance in High-Dimensional and Big Data Classification (Invited).

International Statistics Conference 2024 (ISC2024), Colombo, Sri Lanka.

New Developments for Addressing Class Imbalance Issue in Classification Tasks.

PhD Theis Defense, Department of Statistics, Faculty of Graduate Studies, University of Manitoba.


Three Minute Thesis (3MT®).

Faculty of Graduate Studies, University of Manitoba.


Machine Learning-based Identification of Long COVID Syndrome: Leveraging Encounter Notes Symptoms (Invited).

4th International Conference on Future of Preventive Medicine & Public Health (Future of PMPH 2024).


Novel Approaches to Mitigate Abnormal Instances in Imbalanced Datasets - for Improved Classification Performance.

2024 WNAR/IMS/Graybill Annual Meeting, Fort Collins, Colorado - Student Paper Competition presentation.


2023

Long COVID Prediction in Manitoba Using Clinical Notes Data: A Machine Learning Approach.

CANSSI Show Case 2023


Data to Action Day 2023, organized by the Data Science Program, Government of Manitoba.


2022

Discovering long COVID symptom patterns: Association rule mining in social media tweets.

Statistical Society of Canada (SSC) Annual Meeting 2022.


2021

Modeling and Inference with Feature Importance for Assessing the Quality of Sleep among Chronic Kidney Disease Patients.

Joint Statistical Meetings (JSM) 2021.


Assessing Feature Selection Methods and Their Performance in High Dimensional Classification Problems.

MSc Theis Defense, Department of Statistics, Faculty of Graduate Studies, University of Manitoba.


Assessing Feature Selection Methods and their Performance in High-Dimensional Classification Problems.

Statistical Society of Canada (SSC) Annual Meeting 2021.

Research Interests

Machine Learning, Large Language Models (LLM), NLP, Knowledge Representation, Model Optimization, Statistical Learning, Deep Learning Techniques, Data Imbalance, Bayesian Methods.

My research focuses on machine learning, NLP, and statistical learning, with a special interest in LLMs, knowledge representation, and model optimization. I work on methods for high-dimensional data and data imbalance, developing Bayesian approaches and resampling techniques that enhance model accuracy in fields like healthcare and education. Additionally, I aim to bridge theory and practice, creating efficient, interpretable models that offer reliable, actionable insights for real-world applications.

Awards & Recognition

Scholarly & Professional Activitie