Organic chemistry is nothing but pattern recognition!?
“This is nothing but pattern recognition!” As unsettling as this comment might sound to many organic chemistry practitioners, it was something one of my former supervisors said about (undergraduate) organic chemistry. Years later, this remark resurfaced in my mind as I delved into the world of machine learning.
I vividly remember when one of my brightest college classmates expressed a strong interest in neuroscience. Being more physics- and math-minded, I didn’t quite connect with this idea at the time, and it was not until recently that I realized the process of receiving information and forming memories or decisions could be modeled mathematically as communications between networks of neurons. (Thanks to Fei-Fei Li’s inspiring memoir The Worlds I See; highly recommended!)
The basic unit (algorithm) in these neural networks, aptly named a perceptron, takes in input (electrical) signals and performs binary classifications. It mimics what an actual neuron does: receiving electrical signals, accumulating and modulating them, and firing an output signal only when the total strength of the input signals exceeds a certain threshold. While complex cognitive activities involve large neural networks, at the most fundamental level, whether it is a single neuron or a single perceptron, the primary task is classification.
Classification is indeed how we learn about the world. As babies, we learn to identify a cat and a toy, or what is edible versus inedible. As we grow older, we learn to categorize words into nouns and verbs and recognize facial and vocal expressions that signify emotions like happiness or anger. We make sense of the world by organizing and classifying information, which is essential for understanding, decision-making, and communication.
Humans are amazingly good at classification. It allows us to identify risks in a split second and avoid accidents; it even enables us to sense our surroundings without direct attention. However, this ability is so deeply ingrained in our minds that it can lead to unconscious biases, such as distinguishing people who look different from us and isolating us from embracing diversity.
Returning to chemistry, recognizing that our learning and cognitive processes begin with classification makes the idea that “organic chemistry is nothing but pattern recognition” seem less outrageous. However, going from classification to pattern recognition involves additional tasks such as feature identification/extraction and regression. These tasks may include knowing/evaluating the strength of electrophilicity, the crowdedness of the reaction site, temperature values, and the polarity of the environment, etc. Admittedly, these subtleties may not be fully grasped even by most seasoned chemists, but they can be effectively learned by machines given suitable data sets, as evidenced by the growing number of publications on the application of artificial intelligence in synthetic organic chemistry.
To conclude, let me quote Flynn and Ogilvie from their article in the Journal of Chemical Education: “Chemical reactions follow patterns, and these patterns can allow a chemist to predict how a chemical will behave, even if they have never seen a particular reaction before.” Since we are not ditching learning/teaching/using electron-pushing formalism in our mind anytime soon, we may improve our approach to thinking about and practicing organic chemistry by comparing methods that fundamentally make machines “smarter” in synthesis. .
Decoding “struggle”—my observations
A word that I found very frequently in students’ emails is struggle. Instead of employing alternative expressions like be stuck, have a hard time, have trouble, or have difficulty, they consistently lean towards using struggle. This linguistic divergence presents a cultural contrast for me after relocating to the UK, as I tend to associate struggle with more serious connotations. This observation prompted me to reflect, and realized it might largely be a regional word preference. A cursory exploration of the verb forms of struggle in The Corpus of Global Web-Based English (GloWbE) confirms not only a higher frequency of use but also a markedly greater overall occurrence of this word in GB compared to any other country in the database (Figure 1).
Yet, delving into the historical trajectory of this term reveals a notable surge in popularity around the year 2000 in both British and American books, according to the Google Books Ngram Viewer (Figure 2a). The nearly identical trends observed here may suggest a universal phenomenon across most English-speaking countries. Intriguingly, examining the noun and verb forms of struggle shows distinct patterns (Figure 2b). The noun usage gained traction as early as 1840, consistently surpassing that of the verb form for over a century. Curiously, around 2000, the verb form began to be employed frequently to signify “making strenuous or violent efforts in the face of difficulties or opposition” or “proceeding with difficulty or great effort,” according to Merriam-Webster. The rapid growth in the usage of the verb form is noteworthy, surpassing the noun form in frequency after 2010. In fact, there appears to be a recent decreasing trend in the use of struggle as a noun. Since the Google Books Ngram Viewer only provides data up to 2019, a comparison using other text corpora, such as NOW (News on the Web, 2010–present) and COCA (Corpus of Contemporary American English, 1990–2019), has been included to depict the evolving trends in usage. I shall note that the scale of the year axis is not linear, allowing for a clearer representation of trends in recent years.
The question arises: What caused the rapidly increased use of struggle as a verb? At this stage, I do not want to make any suggestions but simply want to share this observation. How would you interpret this data? Does it suggest that life has become more difficult than it was 20 years ago? While I hesitate to draw such conclusions, I sincerely hope that the current upward trend in the use of struggle will subside in the near future.
Exploring thionation with Hückel molecular orbital theory.
Hückel Molecular Orbital (HMO) theory stands as one of my favourite concepts from my undergraduate studies in physical organic chemistry. Despite its simplicity and the underlying assumptions, it offers an intuitive understanding of π-conjugated molecules and is a readily comprehensible extension of the somewhat simplistic particle-in-a-box model.
Recently, while pondering the question of why thionating a carbonyl molecule enhances its electron-accepting capabilities, as explored in our 2023 publication (PCCP DOI: 10.1039/D2CP05186A), I realized that the good old HMO theory could offer a fairly satisfying explanation. This question arose from our observations while working with various thiocarbonyl molecules (R2C=S). We (as well as others) consistently observed that these molecules exhibit a greater propensity to accept electrons than their carbonyl counterparts (R2C=O). This observation appeared to defy expectations based on a somewhat simplistic consideration of electronegativity (2.58 for sulfur and 3.44 for oxygen).
Upon closer analysis, we revealed that the enhanced electron affinity in the C=O → C=S substitution stems from the weaker overlap between sulfur’s 3p orbitals and carbon’s 2p orbitals in C=S, in comparison to the better interactions between the valence 2p orbitals of carbon and oxygen in C=O. This reduced overlap results in a less effective antibonding interaction, leading to a lower LUMO energy and increased electron affinity.
The diminished overlap between (2p)C and (3p)S is well-represented by the (empirical) parameters utilized in the HMO theory. In the HMO Hamiltonian matrix, the Coulomb integral αC represents the energy of the individual (2pπ)C atomic orbital of carbon, while the resonance integral βCC characterizes the coupling between adjacent (2pπ)C orbitals. By adjusting the integral values for carbon through αX = αC + kX × βCC and βXY = kXY × βCC, the property of the frontier orbitals of heteroatom-containing π systems can be estimated in the Hückel framework (kX and kXY are proportionality constants for heteroatom X).
Using formaldehyde (H2CO) and thioformaldehyde (H2CS ) as examples, since
αO = αC + kO × βCC = αC + 0.97βCC
βC=O = kC=O × βCC = 1.06βCC
and
αS = αC + kS × βCC = αC + 0.46βCC
βC=S = kC=S × βCC = 0.81βCC
the energy of the π* level would be E(π*) = αC – 0.68βCC for H2CO and E(π*) = αC – 0.61βCC for H2CS, i.e. the π* level of H2CS is less destabilized. In fact, even without solving the eigenvalue problem of HMO, the lower π* level of thiocarbonyl can be intuitively expected since the resonance integral βC=S is smaller than βC=O. In our PCCP paper, we further showed that this βC=X could be quantitatively estimated as the |⟨hC|F|hS⟩|, where hx is the natural atomic hybrid orbitals (NHO) and F is the Fock (Kohn–Sham) operator.
However, it is important to note that semiempirical HMO may not always yield the correct picture. When comparing ester (RC(O)OR) and thioester (RC(O)SR), we found that E(π*) = αC – 0.79βCC for the former and αC – 0.86βCC for the latter, suggesting a higher π* level for the sulfur compound. This result, however, conflicts with DFT calculations and the experimental reduction potentials of these molecules. The inconsistency arises from the larger βC–S = 0.69βCC than β C–O = 0.66βCC (as opposed to being smaller in the previous comparison; note the difference in βC=X and βC–X used here). Therefore, despite the elegance of HMO, we recommend using the DFT-based NBO/NHO analysis to obtain a more reliable understanding of the electronic effects of heteroatom substitution on π-conjugated molecules. Please see DOI: 10.1039/D2CP05186A for detailed discussion.
For this work, we also prepared a (yet another!? link1 and link 2) Python script to solve simple Hückel systems (not the ‘extended Hückel’). While this script is primarily designed for π-conjugated hydrocarbons, you can adapt it for heteroatom-containing systems by adjusting the Coulomb and resonance integrals in the Hückel matrix. Several examples are included in this interactive Jupyter Notebook for your reference (hosted in Google Colab).
Furthermore, to enhance usability, the script can generate the Hückel matrix from a molecular SMILES string using the RDKit toolkit’s GetAdjacencyMatrix
module. To end this post: You may find it intriguing to explore this script and compare the energy levels of two isomeric molecules: 2-phenyl-1,3-butadiene: C=CC(c1ccccc1)=C
and 1,4-divinylbenzene: C=Cc1ccc(cc1)C=C
(spoiler alert: they are the same!).
The Bürgi–Dunitz angle revisited.
I came across the blog post by Prof. Rzepa on the Bürgi–Dunitz angle the other day. This is a topic close to my heart. Embarrassingly, I did not know this concept until the mid-way of my PhD study at ETH, where I actually had regular interactions with Prof. Dunitz. The discovery of the Bürgi–Dunitz angle marked one of the key moments in crystallography, organic and supramolecular chemistry, and molecular orbital analysis. I was very impressed by the beauty and the neat idea behind this study, and have since used it as an example to show students the vast information one can get from structural analysis. It was truly insightful that Bürgi and Dunitz were able to distill the preferred reaction trajectory starting from simply 6 (!) crystal structures.
Nearly 50 years later, now that we have more than 1 million structures in the CCDC database, we should be better equipped to perform the structural analysis to re-validate the theory behind the Bürgi–Dunitz angle. However, if you try to look at the angle of any nucleophilic atom (N, O, S, etc.) approaching carbonyl functionalities, the most commonly found angles are actually about 90º but not near 105º, as clearly demonstrated by Prof. Rzepa in the blog post. So, were Bürgi and Dunitz wrong?
After a close examination of the structure hits, I found that the primary source of the discrepancy comes from the overwhelming number of structures featuring antiparallel C=O interactions. In those cases, the O of one moiety sits on the C of the other, resulting in O…C=O angle at ~90º. This is quite interesting; starting from the analysis of nucleophilic addition to a carbonyl group, we ended up with one of the most prevalent non-covalent interactions in proteins.
I’ve added my comment below Prof. Rzepa’s post. Part of the reason that I want to write about it again here is to point out that, really, in the age of machine learning, we can easily obtain a huge amount of data. Yet, it is extremely to keep an eye on the actual information in those data. After all, as ML people often say: your model will be only as good (or as bad) as the data you have.
Goodbye Northwestern, it has been a wonderful time.
After a long period of job searching, I am delighted to take the post at Cardiff University starting from January 2019. I am deeply indebted to all my mentors, colleagues, friends, and family for their help and support during this process. Working on photo energy researches during the past 6 and a half years in Northwestern has been a wonderful experience. It is a true privilege to collaborate the brightest minds (you know I am talking about you!) on the daily basis and be able to access to the cutting-edge technologies. My views to science and its interplay with education and society got to grow and mature, and they are the best gift that I will bring with across the Atlantic Ocean and pass them onto my future coworkers.
Judging at Intel ISEF 2017
“Science’s rightful place is in service of society” (D. Sarewitz, 2013) is always a big part of my belief. This summer, I was very fortunate to participate as a Chemistry Grand awards judge in the International Science and Engineering Fair (ISEF), the biggest science fair in the world.
Although science fair was a huge thing in my high school, I wasn’t doing so well as many of my high school classmates, and ISEF 2017 is the first time for me to see such a high-level competition. I was very impressed by one high schooler’s perseverance with identifying an undocumented ferric sulfate compound from the reaction of sulfuric acid and gold ore, which he obtained from hiking; by the applicability of the algorithm that another student developed to filter and assign signals in high-dimensional protein NMR spectroscopy to accelerate drug discovery (and by his smartness, too); and by the usefulness of silk fibers as moisture-activated torsional actuators discovered by the other student, and by many other projects.
The judges caucus is another special experience. We are composed of industrial scientists, university professors, researchers, postdocs, and PhD students. Some had participated more science fairs than the others; we discussed all(!) the projects and tried to persuade(!) our colleagues why one project is better/worse. The voting/discussion cycle repeated again and again until all the prizes were decided. (Awarded students, you should really thank the eloquent and passionate judges who lobby for your project!)
Judging ISEF was overall a great experience, especially seeing/feeling the pure enthusiasm for the science of all the students, and I am very glad I could contribute and help. Thanks to my grad school friend Grace for the invitation!
Ethylhexyl in real life.
As a “purist”, I never really like to see any 2-ethylhexyl substituent in my molecules, as it usually has an undefined stereogenic center at the 2 position. Materials incorporating such a functionality thus are random mixtures(!) of (R)- and (S)- stereoisomers, not to mention molecules possessing multiple 2-ethylhexyl substituents.
Materials scientists, especially aromatic polymer chemists, use 2-ethylhexyl to enhance the solubility, as such a bulky subsituent disrupts pi-stacking/aggregation. Outside of the research labs, as it turns out, molecules with 2-ethylhexyl are actually ubiquitous in our daily life; I wonder if those 2-ethylhexyls were implemented also for modulating the aggregation properties.
Just to name a few, 2-ethylhexyl nitrate (2-EHN) is a cetane improver added to diesel fuels, bis(2-ethylhexyl) phthalate, which is produced approx. 3 billion kg/year, is a plasticizer for PVC, and octocrylene and octyl methoxylcinnamate are ingredients in sunscreen products that absorb UVB and UVA. Without a doubt, I should have given ethylhexyl much more credits!
We know so little, so little indeed, about solubility.
PDI is a notoriously insoluble dye, and it is well known that linear aliphatic N-substituents won’t make it much more soluble. However, the Grozema group published a mind-blowing work back in 2014 (DOI: 10.1039/c4cc00330f) and totally turn down this common believe (along with other important findings, of course). See the picture! I knew this work for quite a while (thanks Pat for showing me this paper), and still couldn’t make sense out of it. This just shows how little we know about intermolecular interaction; there is more to learn!
Alkyl groups modulate π-stacking interactions: size is not the only thing that matters.
The group of Ken Shimizu at the University of South Carolina reported a very interesting finding: the strength of repulsive and/or attractive interactions between π-stacked aromatics can be non-trivially influenced by the alkyl substituents. Should the interacting area (surface contact area) between aromatics be large enough, even those bearing tBu substituents can display stronger attraction than those bearing Me one!
Does methanol dissolve silica? Maybe not, suggested by Biotage.
This is an age-old question for people using MeOH in their flash chromatography. I guess the answer/result might have something to do with the pore size of the frit of your column. Anyhow, have a look at an interesting analysis conducted in Biotage.