In 2.1, a Cohort shifted from being a TreeSet of Integers to a TreeSet of a new CohortMembership object.
This adds performance overhead to basic Cohort operations. For instance, adding set off 100,000 patient ids to a Cohort jumps from taking milliseconds to taking a second or two. This may seem insignificant, but as Cohorts are used extensively for reporting and Cohorts are may be manipulates dozens of times in a single report, this can add up.
We should find a way to maintain the performance of a Cohort for the majority use case that does not require start date and end date.
I tested switching from storing as a TreeSet to a HashSet... this had a marginal improvement on peformance, certainly not a game-changer.
Some potential solutions include:
- Reverting Cohort back to it's original design and creating a new CohortWithDateRange object... this would introduce some backawrds incompatibility
- Changing the implementation of Cohort so that it supported both an underlying data storage of Set<Integer> or Set<CohortMembership> and only started to use CohortMembership if the consumer specified a start and/or end date when adding a Patient. This might work, but seems hack and error-prone... we'd need to tightly restrict the consumers from accessing the underlying model.