-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure that loc = sloc + multi + single-line comments + blank. #113
Conversation
1 similar comment
@zukoj @rubik Any thoughts on the approach above? |
I like this definition of LOC, it's very practical. I'll investigate a bit on the tokenize issue. I don't even understand why that would be valid Python syntax: a line-continuation to nothing? In any case, if Python can parse it, we have to be able to parse it too. |
The tokenizer seems to be working correctly. This is the result for the code above: [TokenInfo(type=59 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line=''),
TokenInfo(type=1 (NAME), string='def', start=(1, 0), end=(1, 3), line='def func():\n'),
TokenInfo(type=1 (NAME), string='func', start=(1, 4), end=(1, 8), line='def func():\n'),
TokenInfo(type=53 (OP), string='(', start=(1, 8), end=(1, 9), line='def func():\n'),
TokenInfo(type=53 (OP), string=')', start=(1, 9), end=(1, 10), line='def func():\n'),
TokenInfo(type=53 (OP), string=':', start=(1, 10), end=(1, 11), line='def func():\n'),
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 11), end=(1, 12), line='def func():\n'),
TokenInfo(type=5 (INDENT), string=' ', start=(2, 0), end=(2, 4), line=' """ my comment """ \\\n'),
TokenInfo(type=3 (STRING), string='""" my comment """', start=(2, 4), end=(2, 22), line=' """ my comment """ \\\n'),
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 0), end=(3, 1), line='\n'),
TokenInfo(type=1 (NAME), string='pass', start=(4, 4), end=(4, 8), line=' pass\n'),
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 8), end=(4, 9), line=' pass\n'),
TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line=''),
TokenInfo(type=0 (ENDMARKER), string='', start=(5, 0), end=(5, 0), line='')] Whereas, without the backslash: [TokenInfo(type=59 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line=''),
TokenInfo(type=1 (NAME), string='def', start=(1, 0), end=(1, 3), line='def func():\n'),
TokenInfo(type=1 (NAME), string='func', start=(1, 4), end=(1, 8), line='def func():\n'),
TokenInfo(type=53 (OP), string='(', start=(1, 8), end=(1, 9), line='def func():\n'),
TokenInfo(type=53 (OP), string=')', start=(1, 9), end=(1, 10), line='def func():\n'),
TokenInfo(type=53 (OP), string=':', start=(1, 10), end=(1, 11), line='def func():\n'),
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 11), end=(1, 12), line='def func():\n'),
TokenInfo(type=5 (INDENT), string=' ', start=(2, 0), end=(2, 4), line=' """ my comment """\n'),
TokenInfo(type=3 (STRING), string='""" my comment """', start=(2, 4), end=(2, 22), line=' """ my comment """\n'),
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 22), end=(2, 23), line=' """ my comment """\n'),
TokenInfo(type=58 (NL), string='\n', start=(3, 0), end=(3, 1), line='\n'),
TokenInfo(type=1 (NAME), string='pass', start=(4, 4), end=(4, 8), line=' pass\n'),
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 8), end=(4, 9), line=' pass\n'),
TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line=''),
TokenInfo(type=0 (ENDMARKER), string='', start=(5, 0), end=(5, 0), line='')] The only difference is the presence of the NL token (58) in the second case. |
The parsing issue is just with how the line counting algorithm works and it doesn't cause the tool to fail.
If I run the tokenizer on the file above I get:
It needs something else to complete successfully (like the "pass" command). The funny thing is that code above runs fine in python. The algorithm for finding multiline strings is to keep adding lines to a buffer until the tokenizer is satisfied that is has a completely "tokenizable" buffer. However, the issue above means that it may need to consume extra lines into the multi-line string. The mutliline string token still has its correct start and end positions but the next row has now been consumed and is no longer considered source. The result is that this testcase:
reports that there is only 1 sloc and 1 lloc when there should be 2 of each. It considers the whole docstring to be a "single-line comment" which includes the |
Weird, I cannot reproduce the issue. Here's what I get:
I find it extremely odd that your output is different from mine!? |
Since this is at worst a bug in the tokenize module I'm merging this one. This way we can have sane definitions of LOC, SLOC, etc. again. |
I was forced to revert this because it didn't pass the tests. I should have checked before merging here. It gives weird results with oddly high numbers. For example, for def function():
""" docstring continued by blank line is still a single-line comment """
pass it gives
It seems like some lines are counted more twice or more. |
This PR tries to address #109. It fixes loc to genuinely be the sum of sloc, multi-line comments (including blank lines), single-line comments and blank lines (outside of comments). This requires a lot of test changes.
There is a TODO on one case where the tokenizer approach seemed to be getting tripped up:
In this case, the tokenizer is reporting an error until it joins the line containing
pass
. Not sure if this is a bug with the tokenizer or not.I'm also not sure what the intermediate values in the comments should be for the second_mi example.