How Common Are Different Functional Groups?
Since the ostensible purpose of organic methodology is to develop reactions that are useful in the real world, the utility of a method is in large part dictated by the accessibility of the starting materials. If a compound is difficult to synthesize or hazardous to work with, then it’s difficult to convince people to use it in a reaction (e.g. most diazoalkanes). Organic chemists are pragmatic, and would usually prefer to run a reaction that starts from a commercial and bench-stable starting material.
For instance, this explains the immense popularity of the Suzuki reaction: although the Neigishi reaction (using organozinc nucleophiles) usually works better for the same substrates, you can buy lots of the organoboron nucleophiles needed to run a Suzuki and leave them lying around without taking any precautions. In contrast, organozinc compounds usually have to be made from the corresponding organolithium/Grignard reagent and used freshly, which is considerably more annoying.
The ideal starting material, then, is one which is commercially available and cheap. In recent years, it’s become popular to advertise new synthetic methods by showing that they work on exceptionally cheap and common functional groups, and in particular to compare the abundance of different functional groups to demonstrate that one starting material is more common than another. To pick just one of many examples, Dave MacMillan used this plot to show why cross-coupling reactions of alcohols were important (ref):
When I saw MacMillan’s talk at MIT last year, I was curious what it would take to make additional graphics like this. The “number of reactions” plot can be made pretty easily from Reaxys, but I’ve always been uncertain how the “number of commercial sources” plots are made: I haven’t seen references listed for these numbers, nor is anything usually found in the Supporting Information.
I decided to take a swing at getting this data myself by analyzing the Mcule "building blocks" database, which contains about 3.5 million compounds. Although Mcule doesn't define what a building block is (at least, not that I can find), it’s likely that their definition is similar to that of ZINC, which defines building blocks as “those catalogs of compounds available in preparative quantities, typically 250 mg or more” (ref). This seems like a reasonable proxy for the sorts of compounds synthetic chemists might use in reactions. I defined patterns to match a bunch of functional groups using SMARTS/SMILES, and then used RDKit to find matches in the Mcule building blocks database. The code can be found on Github, along with the patterns I used.
The results are shown below. As expected, ethers, amines, amides, and alcohols are quite common. Surprisingly, aryl chlorides aren't that much more common than aryl bromides—and, except for aliphatic fluorides, all aliphatic halides are quite rare. Allenes, carbodiimides, and SF5 groups are virtually unheard of (<100 examples).
(this isn’t a table, it’s an image, because Substack doesn’t do tables evidently—see here for copyable numbers)
(Fair warning: I’ve spotchecked a number of the SMILES files generated (also on Github), but I haven’t looked through every molecule, so it’s possible that there are some faulty matches. I wouldn’t consider these publication-quality numbers yet.)
An obvious caveat: there are lots of commercially “rare” functional groups which are easily accessible from more abundant functional groups. For instance, acid chlorides seem uncommon in the above table, but can usually be made from ubiquitous carboxylic acids with e.g. SOCl2. So these data shouldn’t be taken as a proxy for a more holistic measure of synthetic accessibility—they measure commercial availability, that’s all.
What conclusions can we draw from this?
The most common functional groups are the milquetoast ones: alcohols, amines, esters, etc. Perhaps this explains where all the new reactions have gone: unless your new method works on alcohols or amines, it will struggle to get traction in most of chemical space relative to e.g. Williamson ether synthesis or reductive amination. (Kudos to MacMillan for identifying this; vide supra.)
Ureas are much more common than you’d expect from academic methods papers. This I think speaks to the difference between what methodologists want and what medicinal chemists want. Ureas are a bit annoying to work with: they’re pretty polar by the standards of academia, they’re not always soluble in organic solvents, and they have a tendency to stick to transition metal catalysts or get deprotonated by strong bases. But they’re easy to make in libraries, since the isocyanate/amine disconnection is so robust, and they’re excellent hydrogen-bond donors and acceptors. CORRECTION: There's a SMARTS error, so the match for "ureas" actually matches amides—disregard this section. Thanks to @wmdhn for catching this.
Uncommon functional groups, like SF5 and allenes, are very uncommon. If you want to introduce an SF5 group, you are in for a rough time: there aren’t great ways to add it to molecules (although there have been some steps forward in recent years), and there are only 18 commercial examples. So people can write as many papers as they want about how cool SF5 groups are: I still doubt we’ll see them used very much in the near future.
But also, the abundance of a given functional group is very elastic in the long run. Trifluoromethyl groups used to be extremely rare—they’re not found in nature!—but now 1 in 8 molecules has a CF3 group. CF3 just turns out to be a very good handle for a lot of molecular design tasks, and so people found ways to introduce it all over the place, and now it’s not hard to get molecules that have trifluoromethyl groups. Synthetic chemists should feel good about this.
The functional-group-specific SMILES files are in the previously mentioned Github repo, so anyone who wants to e.g. look through all the commercially available alkenes and perform further cheminformatics analyses can do so. I hope the attached code and data helps other chemists perform similar, and better, studies, and that this sort of thinking can be useful for those who are currently engaged in reaction discovery.
Thanks to Eric Jacobsen for helpful conversations about these data.