Have variable license sections in license rules

Organization: AboutCode

Mentors:

Overview

This project aims to enhance the detection_log by clearly indicating when extra-words are detected. These extra-words represent variable parts in the license rules, which previously caused the match score to fall below 100.

To address this issue, the implementation now verifies whether the extra-words appear in the correct position within the license text. If they do, the score is adjusted and improved accordingly, resulting in more accurate license rule matching.

Implementation

Enhanced the detection_log:
- Display extra-words when they are detected.
Added extra-phrase marker like [[n]] for the extra-words:
- The extra-phrase is denoted by double opening square brackets [[ and double closing square brackets ]].
- Here, n represents the maximum number of allowable extra-words.
- The extra-phrase [[n]] is inserted in license rules at positions where extra-words may appear.
- The value of n specifies how many extra-words are permitted at that location.
Improve Score:
- Check whether extra-words appear in the correct position as defined by the extra-phrase, and ensure they do not exceed the maximum allowable limit.
- If the conditions are satisfied, increase the match score to 100.
Shows in detection_log:
- If the score is increased that means extra-words are in the correct position, then show extra-words-permitted-in-rule in the detection_log.
- If the extra-words are at wrong place or exceed the maximum allowable limit, then show extra-words in the detection_log.
Testing:
- Added tests for the extra-phrase functionality, such as test_extra_phrase_tokenizer and test_extra_phrase_spans, to ensure that phrases are correctly identified and processed.
- Implemented multiple tests to verify that extra-words appear in the correct position according to the rules and that the match score is updated correctly when they are within the allowable limit.
- Covered various edge cases where extra-words might be misplaced or exceed the maximum allowable count, ensuring the scoring and logging behave as expected.

Linked Pull Requests

Sr. no	Name	Link	Status
1	Display extra-words in detection_log if present	aboutcode.org/scancode-toolkit#4402	Merged
2	Improve score by supporting extra_phrase for extra-words in rules	aboutcode.org/scancode-toolkit#4432	Open
3	Add extra-phrase in rules	aboutcode.org/scancode-toolkit#4518	Open

Pre GSoC Work

Before GSoC, I had contributed the following PRs:

Sr. no	Name	Link
1	Renaming the dependency attribute is_resolved to is_pinned	aboutcode-org/scancode-workbench#638
2	Add test for all PyPI METADATA versions	aboutcode-org/scancode-toolkit#4180
3	Add test for false positive GPL3 license	aboutcode-org/scancode-toolkit#4106
4	Add new rules for EUPL license	aboutcode-org/scancode-toolkit#4204
5	Add DUMB License and detection rule	aboutcode-org/scancode-toolkit#4400
6	Fixing the dead link by cross-reference in the documentation	aboutcode-org/purldb#550
7	Add test for equivalent word	aboutcode-org/scancode-toolkit#4305
8	Enhance code visibility in dark mode	aboutcode-org/scancode-workbench#637

Post GSoC

I plan to continue contributing by adding extra-phrase support across many license rules. This will strengthen license detection by making it more accurate and flexible in handling variations within the rules.

For identifying named entities in rules, I created a new repository i.e named-entity-utils which I am currently working on. This utility is used to add extra-phrase markers in rules at positions where named entities are present.

Links

Acknowledgements

I would like to thank my mentors:

A special thanks to my mentors who always supported me throughout this journey. Whenever I faced a problem, we discussed it in depth during our weekly status calls. Without their guidance and constant help, completing this project would not have been possible.

I also plan to explore more projects in AboutCode and contribute whenever I get time, because I would love to remain a part of this wonderful organization.