Have variable license sections in license rules

Organization: AboutCode

Projects: Scancode Toolkit

Mentee: Alok Kumar (alok1304)

Mentors:

Overview

This project aims to enhance the detection_log by clearly indicating when extra-words are detected. These extra-words represent variable parts in the license rules, which previously caused the match score to fall below 100.

To address this issue, the implementation now verifies whether the extra-words appear in the correct position within the license text. If they do, the score is adjusted and improved accordingly, resulting in more accurate license rule matching.


Implementation

  • Enhanced the detection_log:

    • Display extra-words when they are detected.

  • Added extra-phrase marker like [[n]] for the extra-words:

    • The extra-phrase is denoted by double opening square brackets [[ and double closing square brackets ]].

    • Here, n represents the maximum number of allowable extra-words.

    • The extra-phrase [[n]] is inserted in license rules at positions where extra-words may appear.

    • The value of n specifies how many extra-words are permitted at that location.

  • Improve Score:

    • Check whether extra-words appear in the correct position as defined by the extra-phrase, and ensure they do not exceed the maximum allowable limit.

    • If the conditions are satisfied, increase the match score to 100.

  • Shows in detection_log:

    • If the score is increased that means extra-words are in the correct position, then show extra-words-permitted-in-rule in the detection_log.

    • If the extra-words are at wrong place or exceed the maximum allowable limit, then show extra-words in the detection_log.

  • Testing:

    • Added tests for the extra-phrase functionality, such as test_extra_phrase_tokenizer and test_extra_phrase_spans, to ensure that phrases are correctly identified and processed.

    • Implemented multiple tests to verify that extra-words appear in the correct position according to the rules and that the match score is updated correctly when they are within the allowable limit.

    • Covered various edge cases where extra-words might be misplaced or exceed the maximum allowable count, ensuring the scoring and logging behave as expected.


Linked Pull Requests

Sr. no

Name

Link

Status

1

Display extra-words in detection_log if present

aboutcode.org/scancode-toolkit#4402

Merged

2

Improve score by supporting extra_phrase for extra-words in rules

aboutcode.org/scancode-toolkit#4432

Open

3

Add extra-phrase in rules

aboutcode.org/scancode-toolkit#4518

Open

Pre GSoC Work

Before GSoC, I had contributed the following PRs:

Sr. no

Name

Link

1

Renaming the dependency attribute is_resolved to is_pinned

aboutcode-org/scancode-workbench#638

2

Add test for all PyPI METADATA versions

aboutcode-org/scancode-toolkit#4180

3

Add test for false positive GPL3 license

aboutcode-org/scancode-toolkit#4106

4

Add new rules for EUPL license

aboutcode-org/scancode-toolkit#4204

5

Add DUMB License and detection rule

aboutcode-org/scancode-toolkit#4400

6

Fixing the dead link by cross-reference in the documentation

aboutcode-org/purldb#550

7

Add test for equivalent word

aboutcode-org/scancode-toolkit#4305

8

Enhance code visibility in dark mode

aboutcode-org/scancode-workbench#637

Post GSoC

I plan to continue contributing by adding extra-phrase support across many license rules. This will strengthen license detection by making it more accurate and flexible in handling variations within the rules.

For identifying named entities in rules, I created a new repository i.e named-entity-utils which I am currently working on. This utility is used to add extra-phrase markers in rules at positions where named entities are present.

Acknowledgements

I would like to thank my mentors:

A special thanks to my mentors who always supported me throughout this journey. Whenever I faced a problem, we discussed it in depth during our weekly status calls. Without their guidance and constant help, completing this project would not have been possible.

I also plan to explore more projects in AboutCode and contribute whenever I get time, because I would love to remain a part of this wonderful organization.