Bottlenecks are hard to gauge without usage patterns and exact hardware specs, but if we assume all of the machines in the cluster have reliable network connectivity, quality network interfaces, quality disks and plenty of CPU/RAM, then potential bottlenecks are as follows.
Possible bottlenecks by service:
Bookshelf (disk i/o)
Postgresql (memory, cpu, disk i/o)
Solr (memory, cpu, disk i/o)
Reporting (increases postgres utilization significantly)
Manage (cpu, memory)
Possible bottlenecks by topology:
FE (memory, cpu)
BE (memory, cpu, disk i/o)
Specific possible bottlenecks by usage pattern:
During the bootstrapping you'll have increased CPU usage on the FE's during the client key creation. You'll also see increased bookshelf disk I/O when downloading the initial cookbooks, but afterwards the client will use the local cache on disk when possible. We've implemented cookbook caching on the FE's, a feature that should ship during the next Chef Server release that should help in reducing I/O to the BE. The client also now has the ability to self create it's own client key, however most people aren't using it and there are theoretical concerns about client creation on new machines without much entropy.
When searching you're going to see an increase in CPU and memory on the BE's as Solr scans it's index. Unpacking the gzipped node blobs will also cause additional disk I/O, memory, and cpu usage.
chef-client runs (erchef, postgres, Solr)
Client runs are difficult to profile due to the variance in usage patterns (search, databags), but generally the cookbook dependency graph solving is a FE CPU increase. Cookbook downloads are a memory and disk i/o increase, node saves are a CPU and memory increase, both on the BE.
Manage UI Sessions (erchef, postgres, solr)
The manage UI uses search and erchef endpoints to paint several of the dropdowns. That usually will increase load on Solr and Postgres significantly.
We tend to throw upwards 50% of available memory and CPU at Postgres and split the remaining for other services on the BE. After we've gathered data of your usage patterns we should be able to tune the services appropriately.
Despite it not being a recommended configuration, we have customers that are running clusters (4 FE, 1 BE HA Pair) with 150k+ nodes. We find slow spinning disks on the BE are generally the most common bottleneck when you get a cluster that busy, so SSD storage is pretty much a necessity.
We don't support scaling out the BE services independently, as the complexity of maintaining consistency for backup/HA/upgrades is significant. We might support that in the future, but we usually recommend smaller clusters with single BE HA Pairs.