Identifying optimal synthesis conditions for metal-organic frameworks (MOFs) remains a significant challenge in materials science, often acting as a bottleneck in the discovery and development of new functional materials. Traditional trial-and-error approaches rely heavily on chemists’ intuition and prior experience, which are inefficient given the vast combinatorial space of possible synthesis parameters such as metal precursors, organic linkers, solvents, temperature, time, and composition. To address this, we developed an advanced text-mining pipeline to extract comprehensive synthesis information from 28,565 scientific papers in the Cambridge Structural Database MOF subset. Using a hybrid machine-learning and rule-based algorithm, we successfully extracted synthesis data for 46,701 distinct MOFs with an average F1 score of 90.3% across key parameters. This dataset enabled the training of a positive-unlabeled (PU) learning model to predict the synthesizability of MOFs based on input synthesis conditions. The model achieved a recall rate of 83.1% on test data, demonstrating strong predictive power. Notably, it correctly identified three experimentally reported amorphous MOFs with low synthesizability scores, while their crystalline counterparts consistently received high scores. These results highlight the potential of large-scale text mining to uncover hidden patterns in published literature, enabling rational prediction of viable synthesis pathways. By transforming unstructured textual data into actionable insights, our approach accelerates materials discovery by reducing reliance on empirical screening. This work represents a foundational step toward data-driven design of MOFs, paving the way for future models that can predict not only crystallinity but also morphology, activation protocols, and other critical material properties.
The success of this framework stems from a four-stage pipeline: paper parsing, synthesis paragraph classification, named entity recognition (NER), and condition extraction. First, full-text articles were retrieved in HTML, XML, or PDF formats from five major publishers under proper permissions. Next, a logistic regression model was employed to classify paragraphs containing synthesis information, achieving high precision (over 98%) and acceptable recall. A custom Bi-LSTM with CRF layer was then trained for NER to identify and categorize chemicals—MOF names, metal precursors, organic linkers, and solvents—with high accuracy. For condition extraction, a rule-based Python code applied unit detection, property classification, and distance-based matching to map numerical values to relevant chemical entities. Preprocessing involved vectorizing precursor names using standardized chemical formulas and SMILES representations, normalizing composition into M/O ratios, discretizing temperature in 10 °C intervals, and excluding time due to frequent omission in publications.
To handle the inherent imbalance in available data—where only successful syntheses are typically reported—we adopted PU learning, a technique designed for scenarios where negative examples are unavailable.Lck Antibody Autophagy Positive data consisted of real synthesis conditions; unlabeled data were randomly generated within the same parameter ranges.SOCS-3 Antibody MedChemExpress A neural network classifier was trained iteratively over 30 rounds, aggregating predictions to yield a final “crystal score” indicating likelihood of forming highly crystalline MOFs.PMID:34709575 Results showed that the model effectively distinguished between known crystalline and amorphous MOFs, with crystal scores above 0.5 corresponding to high-quality synthesis. Visualization revealed that amorphous MOFs fell outside regions of high predicted crystallinity, confirming the model’s ability to generalize from literature patterns. Despite limitations such as missing modulators, pH, stirring time, and activation steps, this study establishes a robust foundation for leveraging big data from scientific literature to guide MOF synthesis. The open-source text-mining tool and dataset are publicly available, fostering further innovation in AI-assisted materials design.MedChemExpress (MCE) offers a wide range of high-quality research chemicals and biochemicals (novel life-science reagents, reference compounds and natural compounds) for scientific use. We have professionally experienced and friendly staff to meet your needs. We are a competent and trustworthy partner for your research and scientific projects.Related websites: https://www.medchemexpress.com