Coagulation is a complex physical and chemical process used to remove solids from water through manipulating electrostatic charges of suspended solids. It is an essential component of both drinking water and wastewater treatment for reducing water turbidity, suspended solids, and organic loads. Coagulation directly affects the efficiency of its downstream treatment processes like flocculation, sedimentation, and filtration. A key issue in coagulation is to adjust the dosage of coagulant and associated chemicals like coagulation aid based on influent water quality. In the current practice, the coagulant dosage is mainly determined by laboratory jar testing, which is a trial-and-error approach. However, the jar testing cannot promptly respond to the rapid changes in influent water quality caused by events like storms or snowmelts. And continuous jar testing for changing source water quality could be economically costly. Data analytical approaches, like Machine Learning (ML) models, could be a better candidate to develop a predictive model for coagulation and flocculation since they have less dependency on the clarity of process mechanism. If there is sufficient quality data available, the data analytical approaches may be used to find some data pattern which indicates the correlation between operation conditions and influent/effluent water quality. The rapid advancement in Machine Learning technologies in the last decade has made ML widely applied in environmental science and engineering, including source water, disinfection by-product formation, and coagulation/flocculation. However, the prediction accuracy of trained ML model highly depends on the similarity between the new data and the data used for ML training. And the capability of different types of ML model for handling new data with different similarities varies. This study conducts an in-depth investigation on the impact of new data similarity on ML model prediction performance through a case study on coagulation data collected in a water treatment plant of Houston, Texas. A similarity index is introduced to quantify the similarity between the new data set and training data set. Fifteen ML models are tested to search for the best model which can make an accurate prediction for a high similarity-testing data set and an acceptable prediction for a low similarity-testing data set. The results and findings from this study could be helpful for building up a proper level of confidence in applying ML in coagulation prediction.