Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that loc = sloc + multi + single-line comments + blank. #113

Merged
merged 1 commit into from
May 10, 2017

Conversation

ashanbrown
Copy link
Contributor

@ashanbrown ashanbrown commented Apr 6, 2017

This PR tries to address #109. It fixes loc to genuinely be the sum of sloc, multi-line comments (including blank lines), single-line comments and blank lines (outside of comments). This requires a lot of test changes.

There is a TODO on one case where the tokenizer approach seemed to be getting tripped up:

def func():
    """ my comment """ \
   
    pass

In this case, the tokenizer is reporting an error until it joins the line containing pass. Not sure if this is a bug with the tokenizer or not.

I'm also not sure what the intermediate values in the comments should be for the second_mi example.

@coveralls
Copy link

coveralls commented Apr 6, 2017

Coverage Status

Coverage increased (+0.009%) to 98.433% when pulling 904ca9f on ashanbrown:fix-loc into 74c01b7 on rubik:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.009%) to 98.433% when pulling 904ca9f on ashanbrown:fix-loc into 74c01b7 on rubik:master.

@ashanbrown
Copy link
Contributor Author

@zukoj @rubik Any thoughts on the approach above?

@rubik
Copy link
Owner

rubik commented Apr 10, 2017

I like this definition of LOC, it's very practical. I'll investigate a bit on the tokenize issue. I don't even understand why that would be valid Python syntax: a line-continuation to nothing? In any case, if Python can parse it, we have to be able to parse it too.

@rubik rubik mentioned this pull request Apr 10, 2017
@rubik
Copy link
Owner

rubik commented Apr 10, 2017

The tokenizer seems to be working correctly. This is the result for the code above:

[TokenInfo(type=59 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line=''),
 TokenInfo(type=1 (NAME), string='def', start=(1, 0), end=(1, 3), line='def func():\n'),
 TokenInfo(type=1 (NAME), string='func', start=(1, 4), end=(1, 8), line='def func():\n'),
 TokenInfo(type=53 (OP), string='(', start=(1, 8), end=(1, 9), line='def func():\n'),
 TokenInfo(type=53 (OP), string=')', start=(1, 9), end=(1, 10), line='def func():\n'),
 TokenInfo(type=53 (OP), string=':', start=(1, 10), end=(1, 11), line='def func():\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 11), end=(1, 12), line='def func():\n'),
 TokenInfo(type=5 (INDENT), string='    ', start=(2, 0), end=(2, 4), line='    """ my comment """ \\\n'),
 TokenInfo(type=3 (STRING), string='""" my comment """', start=(2, 4), end=(2, 22), line='    """ my comment """ \\\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 0), end=(3, 1), line='\n'),
 TokenInfo(type=1 (NAME), string='pass', start=(4, 4), end=(4, 8), line='    pass\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 8), end=(4, 9), line='    pass\n'),
 TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line=''),
 TokenInfo(type=0 (ENDMARKER), string='', start=(5, 0), end=(5, 0), line='')]

Whereas, without the backslash:

[TokenInfo(type=59 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line=''),
 TokenInfo(type=1 (NAME), string='def', start=(1, 0), end=(1, 3), line='def func():\n'),
 TokenInfo(type=1 (NAME), string='func', start=(1, 4), end=(1, 8), line='def func():\n'),
 TokenInfo(type=53 (OP), string='(', start=(1, 8), end=(1, 9), line='def func():\n'),
 TokenInfo(type=53 (OP), string=')', start=(1, 9), end=(1, 10), line='def func():\n'),
 TokenInfo(type=53 (OP), string=':', start=(1, 10), end=(1, 11), line='def func():\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 11), end=(1, 12), line='def func():\n'),
 TokenInfo(type=5 (INDENT), string='    ', start=(2, 0), end=(2, 4), line='    """ my comment """\n'),
 TokenInfo(type=3 (STRING), string='""" my comment """', start=(2, 4), end=(2, 22), line='    """ my comment """\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 22), end=(2, 23), line='    """ my comment """\n'),
 TokenInfo(type=58 (NL), string='\n', start=(3, 0), end=(3, 1), line='\n'),
 TokenInfo(type=1 (NAME), string='pass', start=(4, 4), end=(4, 8), line='    pass\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 8), end=(4, 9), line='    pass\n'),
 TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line=''),
 TokenInfo(type=0 (ENDMARKER), string='', start=(5, 0), end=(5, 0), line='')]

The only difference is the presence of the NL token (58) in the second case.

@ashanbrown
Copy link
Contributor Author

ashanbrown commented Apr 10, 2017

The parsing issue is just with how the line counting algorithm works and it doesn't cause the tool to fail.

def func():
    """ my comment """ \
    """ continued """

If I run the tokenizer on the file above I get:

1,0-1,3:	NAME	'def'
1,4-1,8:	NAME	'func'
1,8-1,9:	OP	'('
1,9-1,10:	OP	')'
1,10-1,11:	OP	':'
1,11-1,12:	NEWLINE	'\n'
1,16-1,34:	STRING	'""" my comment """'
1,41-1,58:	STRING	'""" continued """'
1,58-1,59:	NEWLINE	'\n'
1,59-1,60:	NEWLINE	'\n'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/tokenize.py", line 170, in tokenize
    tokenize_loop(readline, tokeneater)
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/tokenize.py", line 176, in tokenize_loop
    for token_info in generate_tokens(readline):
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/tokenize.py", line 380, in generate_tokens
    raise TokenError, ("EOF in multi-line statement", (lnum, 0))
tokenize.TokenError: ('EOF in multi-line statement', (2, 0))

It needs something else to complete successfully (like the "pass" command). The funny thing is that code above runs fine in python. The algorithm for finding multiline strings is to keep adding lines to a buffer until the tokenizer is satisfied that is has a completely "tokenizable" buffer. However, the issue above means that it may need to consume extra lines into the multi-line string. The mutliline string token still has its correct start and end positions but the next row has now been consumed and is no longer considered source. The result is that this testcase:

      def function():
          """ docstring continued by blank line is still a single-line comment """ \
  
          pass

reports that there is only 1 sloc and 1 lloc when there should be 2 of each. It considers the whole docstring to be a "single-line comment" which includes the pass command.

@rubik
Copy link
Owner

rubik commented Apr 13, 2017

Weird, I cannot reproduce the issue. Here's what I get:

$ cat a.py
def func():
    """ my comment """ \
    """ continued """
$ python2.7 -m tokenize a.py
1,0-1,3:	NAME	'def'
1,4-1,8:	NAME	'func'
1,8-1,9:	OP	'('
1,9-1,10:	OP	')'
1,10-1,11:	OP	':'
1,11-1,12:	NEWLINE	'\n'
2,0-2,4:	INDENT	'    '
2,4-2,22:	STRING	'""" my comment """'
3,4-3,21:	STRING	'""" continued """'
3,21-3,22:	NEWLINE	'\n'
4,0-4,0:	DEDENT	''
4,0-4,0:	ENDMARKER	''
$ python3.6 -m tokenize a.py
0,0-0,0:            ENCODING       'utf-8'        
1,0-1,3:            NAME           'def'          
1,4-1,8:            NAME           'func'         
1,8-1,9:            OP             '('            
1,9-1,10:           OP             ')'            
1,10-1,11:          OP             ':'            
1,11-1,12:          NEWLINE        '\n'           
2,0-2,4:            INDENT         '    '         
2,4-2,22:           STRING         '""" my comment """'
3,4-3,21:           STRING         '""" continued """'
3,21-3,22:          NEWLINE        '\n'           
4,0-4,0:            DEDENT         ''             
4,0-4,0:            ENDMARKER      ''

I find it extremely odd that your output is different from mine!?

@rubik
Copy link
Owner

rubik commented May 10, 2017

Since this is at worst a bug in the tokenize module I'm merging this one. This way we can have sane definitions of LOC, SLOC, etc. again.

@rubik rubik merged commit 302ecc6 into rubik:master May 10, 2017
rubik added a commit that referenced this pull request May 10, 2017
rubik added a commit that referenced this pull request May 10, 2017
@rubik
Copy link
Owner

rubik commented May 10, 2017

I was forced to revert this because it didn't pass the tests. I should have checked before merging here. It gives weird results with oddly high numbers. For example, for

def function():
    """ docstring continued by blank line is still a single-line comment """

    pass

it gives

ex.py
    LOC: 8
    LLOC: 2
    SLOC: 5
    Comments: 0
    Single comments: 1
    Multi: 0
    Blank: 2

It seems like some lines are counted more twice or more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants