Have variable license sections in license rules
Organization: AboutCode
Projects: Scancode Toolkit
Mentee: Alok Kumar (alok1304)
Mentors:
Overview
This project aims to enhance the detection_log by clearly indicating when extra-words are detected. These extra-words represent variable parts in the license rules, which previously caused the match score to fall below 100.
To address this issue, the implementation now verifies whether the extra-words appear in the correct position within the license text. If they do, the score is adjusted and improved accordingly, resulting in more accurate license rule matching.
Implementation
Enhanced the detection_log:
Display extra-words when they are detected.
Added extra-phrase marker like [[n]] for the extra-words:
The extra-phrase is denoted by double opening square brackets
[[
and double closing square brackets]]
.Here, n represents the maximum number of allowable extra-words.
The extra-phrase
[[n]]
is inserted in license rules at positions where extra-words may appear.The value of n specifies how many extra-words are permitted at that location.
Improve Score:
Check whether extra-words appear in the correct position as defined by the extra-phrase, and ensure they do not exceed the maximum allowable limit.
If the conditions are satisfied, increase the match score to
100
.
Shows in detection_log:
If the score is increased that means extra-words are in the correct position, then show
extra-words-permitted-in-rule
in the detection_log.If the extra-words are at wrong place or exceed the maximum allowable limit, then show
extra-words
in the detection_log.
Testing:
Added tests for the extra-phrase functionality, such as test_extra_phrase_tokenizer and test_extra_phrase_spans, to ensure that phrases are correctly identified and processed.
Implemented multiple tests to verify that extra-words appear in the correct position according to the rules and that the match score is updated correctly when they are within the allowable limit.
Covered various edge cases where extra-words might be misplaced or exceed the maximum allowable count, ensuring the scoring and logging behave as expected.
Linked Pull Requests
Sr. no |
Name |
Link |
Status |
---|---|---|---|
1 |
Display extra-words in detection_log if present |
Merged |
|
2 |
Improve score by supporting extra_phrase for extra-words in rules |
Open |
|
3 |
Add extra-phrase in rules |
Open |
Pre GSoC Work
Before GSoC, I had contributed the following PRs:
Sr. no |
Name |
Link |
---|---|---|
1 |
Renaming the dependency attribute is_resolved to is_pinned |
|
2 |
Add test for all PyPI METADATA versions |
|
3 |
Add test for false positive GPL3 license |
|
4 |
Add new rules for EUPL license |
|
5 |
Add DUMB License and detection rule |
|
6 |
Fixing the dead link by cross-reference in the documentation |
|
7 |
Add test for equivalent word |
|
8 |
Enhance code visibility in dark mode |
Post GSoC
I plan to continue contributing by adding extra-phrase support across many license rules. This will strengthen license detection by making it more accurate and flexible in handling variations within the rules.
For identifying named entities in rules, I created a new repository i.e named-entity-utils which I am currently working on. This utility is used to add extra-phrase markers in rules at positions where named entities are present.
Links
Acknowledgements
I would like to thank my mentors:
A special thanks to my mentors who always supported me throughout this journey. Whenever I faced a problem, we discussed it in depth during our weekly status calls. Without their guidance and constant help, completing this project would not have been possible.
I also plan to explore more projects in AboutCode and contribute whenever I get time, because I would love to remain a part of this wonderful organization.