Design Rules for Data Mining Algorithms : A Tryst with Modularity
I took a few Machine Learning courses, and most of them introduced ready-made toolboxes in languages like R and Python. These ready-made toolboxes are good as long as we do not dig deep into the nitty-gritties of the algorithm. Most of the software packages present Machine Learning tasks as a monolithic subroutine which makes it harder, if not impossible, to customize their operations except maybe change a parameter to the subroutine itself. These software packages are not modular.
Modular software architectures are evolvable, easy to analyse and upgrade. In the book Design Rules [Baldwin]_, the authors outline the benefits of modularity with the following six operators for modular designs:
- Splitting
- Substitution
- Augmenting
- Excluding
- Inversion
- Porting
Splitting and Substitution are complementary operators which allow interconnected task to be split into independent tasks and evolve separately or replaced by a performant module. Similarly, Augmenting and Excluding are complementary operators which can enhance an already existing design by adding a helper module or taking away a redundant module from the design. Inversion and Porting allow a design to move common modules reusable across all the other modules. These are good thumbrules for designing modular systems and some framework like Guice make it easier to apply these to software design than others.
- Task The data mining task the algorithm uses. It could be - regression, classification, clustering etc.
- Structure The functional form of the model we are fitting our data to.
- Score Function It is used to judge the quality of the fitted models. We usually try to minimize or maximize this function.
- Search Method The search heuristic or the optimization method we use to maximize or minimize our Score Function.
- Data Management Techniques Data Management technique is one of the most ignored aspects of Machine learning algorithms. This is where Computational Resources at hand come into picture. Massive datasets can change the game of the machine learning procedure.
Name | Task | Structure | Score Function | Search Method | Data Management Techniques |
---|---|---|---|---|---|
CART | Classification and Regeression | Decision Tree | Cross-Validted Loss Function | Greedy Search Over Structures | Unspecified |
Backpropagation | Regression | Neural Network | Squared Error | Gradient Descent On Parameters | Unspecified |
A Priori | Rule Pattern Discovery | Association Rules | Support/Accuracy | Breadth-first Search | Linear Scans |
Vector Space for Information Retrieval | Retrieval of Similar Documents | Vector of Term Occurences | Angle Between Two Vectors | Various Techniques | Fast Indexing Techniques |