HAB occurrences are increasing globally, with detrimental impacts on lake ecosystems. A further cause for concern is their unpredictable production of toxins that can be harmful to animals and humans. The ability to predict HAB abundance accurately is crucial for implementing timely mitigation and intervention strategies to protect both aquatic ecosystems and public health. The application of data-driven models for forecasting HAB events is expanding. However, data-driven models are heavily reliant on data availability, and data availability has significant influences on model performance. This study investigates the potential impacts of data gaps on the accuracy and reliability of HAB prediction models for several lakes within the US. Three data imputation methods are examined to reconstruct missing values in datasets: Random Forest Regression, k-Nearest Neighbors, and Multiple Imputation by Chained Equations. A machine learning model, Random Forest, is employed to forecast cyanobacteria abundance. Similarity analysis was also used with Euclidean distance and Dynamic Time Warping to calculate similarity metrics, and agglomerative clustering was used to cluster the lakes based on their similarity. Four predictive cases were tested, and two of them were designed to evaluate the possibility that utilizing random lake datasets and datasets from similar lakes might lead to improved model performance and accuracy. The result shows that similarity-based clustering of lakes can be useful to improve model performance. However, similarity-based clustering may not always accurately capture the underlying patterns and variances across all lakes. Experiences with machine learning models in this context will be shared at the presentation.