Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pseudo-terminal (pty) devices stay open/allocated after run() finishes #518

Closed
bitprophet opened this issue Apr 25, 2018 · 3 comments
Closed

Comments

@bitprophet
Copy link
Member

This is circumstantial, but I had a funky issue today that seems possibly Invoke's fault:

  • Started getting Device not configured errors when trying to open new tmux windows or panes
  • Couldn't figure out what I'd been doing that would hold open too many file descriptors or terminals
  • Realized I had inv watch-docs (from invocations, which uses watchdog to basically run the docs.build task - which calls Sphinx, via run(..., pty=True), on any file change) running and perhaps it was not closing out resources correctly
  • Closed it and the issue immediately resolved, so that's suspicious
  • Question is then why it would be holding too many resources open? Shouldn't the forks created in Local via pty.fork disappear once we're done? Ditto the actual pseudo-terminals themselves?
  • Seems likely that we're missing some cleanup that needs to run and it's simply undetected due to so few single Invoke sessions spawning that many ptys
    • Could also be some issue more specific to either the invocations watch module, or watchdog itself, though this seems less likely
    • Certainly a workaround for this specific issue would be to call Sphinx in-python instead of incurring a new shell/subprocess, but that's not the point here...
@bitprophet
Copy link
Member Author

bitprophet commented Aug 26, 2019

More reports of this from Linux platforms, which see many files created in /dev/pts (as in, thousands) and eventually receive os.error with message out of pty devices. I assume, but have not verified, that this is the same underlying error type as I reported in the OP, just different platform flavor.

That report concerns long-running Python processes using the anonymous top level invoke.run(), which creates and throws away an unbound Context object many times; whereas I think my watchdog invocations code above was using a task's Context, and therefore is a single Context for the lifetime of the process.

However, without having reread the OP until just now, I came to the same conclusion it does - this is likely an OS level issue where I simply am unaware of how to "close" an in-use pty so it is fully released before the process itself ends. So wondering about Python level garbage collection (the other obvious culprit in such situations) seems moot.

Especially given that it's Runner, not Context, that manages all this - and even the single-Context scenario is going to be creating many anonymous internal Runner objects over its lifetime.

@bitprophet bitprophet changed the title Possible issue with failure to close pty after run() in long-running programs Pseudo-terminal devices stay open/allocated after run() finishes Aug 26, 2019
@bitprophet bitprophet changed the title Pseudo-terminal devices stay open/allocated after run() finishes Pseudo-terminal (pty) devices stay open/allocated after run() finishes Aug 26, 2019
@bitprophet
Copy link
Member Author

Seems trivially reproducible with the following handy dandy all in one script:

import os                                                    
                                                             
from invoke import Local, Context                            
                                                             
                                                             
class PtyReporter(Local):                                    
    def start(self, *args, **kwargs):                        
        super().start(*args, **kwargs)                       
        if self.pid != 0:                                    
            print(self.parent_fd)  # this will be the FD of the allocated pty                            
                                                             
for _ in range(10):                                          
    PtyReporter(Context()).run('whoami', pty=True, hide=True)
                                                             
Local(Context()).run("lsof -p {}".format(os.getpid()))       

Output, on macOS 10.14:

» python dangling.py
3
4
5
6
7
8
9
10
11
12
COMMAND     PID     USER   FD   TYPE             DEVICE SIZE/OFF     NODE NAME
<snipped actual code, shlib etc which are static/irrelevant>
python3.7 80774 jforcier    0u   CHR               16,9  0t85140    11709 /dev/ttys009
python3.7 80774 jforcier    1u   CHR               16,9  0t85140    11709 /dev/ttys009
python3.7 80774 jforcier    2u   CHR               16,9  0t85140    11709 /dev/ttys009
python3.7 80774 jforcier    3u   CHR              15,10     0t10      580 /dev/ptmx
python3.7 80774 jforcier    4u   CHR              15,11     0t10      580 /dev/ptmx
python3.7 80774 jforcier    5u   CHR              15,12     0t10      580 /dev/ptmx
python3.7 80774 jforcier    6u   CHR              15,13     0t10      580 /dev/ptmx
python3.7 80774 jforcier    7u   CHR              15,14     0t10      580 /dev/ptmx
python3.7 80774 jforcier    8u   CHR              15,15     0t10      580 /dev/ptmx
python3.7 80774 jforcier    9u   CHR              15,18     0t10      580 /dev/ptmx
python3.7 80774 jforcier   10u   CHR              15,19     0t10      580 /dev/ptmx
python3.7 80774 jforcier   11u   CHR              15,20     0t10      580 /dev/ptmx
python3.7 80774 jforcier   12u   CHR              15,21     0t10      580 /dev/ptmx
python3.7 80774 jforcier   14   PIPE 0xe5c9ce895c214dab    16384          ->0xe5c9ce895c2156ab
python3.7 80774 jforcier   15   PIPE 0xe5c9ce895c2143eb    16384          ->0xe5c9ce895c215b2b
python3.7 80774 jforcier   17   PIPE 0xe5c9ce895c21522b    16384          ->0xe5c9ce895c2135ab

And sure enough, we see the FDs increment, and they're all still held open by the process before it exits. I'd expect identical on Linux except with different device paths and suchlike.

So now the question is whether this is as simple as doing os.close(fd) or whatever.

@bitprophet
Copy link
Member Author

Yup, the following amend to above script and its output:

# Added to the PtyReporter Local subclass
    def stop(self):             
        os.close(self.parent_fd)
» python dangling.py
3
3
3
3
3
3
3
3
3
3
COMMAND     PID     USER   FD   TYPE             DEVICE SIZE/OFF     NODE NAME
python3.7 80862 jforcier    0u   CHR               16,9  0t88772    11709 /dev/ttys009
python3.7 80862 jforcier    1u   CHR               16,9  0t88772    11709 /dev/ttys009
python3.7 80862 jforcier    2u   CHR               16,9  0t88772    11709 /dev/ttys009
python3.7 80862 jforcier    4   PIPE 0xe5c9ce895c21522b    16384          ->0xe5c9ce895c214dab
python3.7 80862 jforcier    5   PIPE 0xe5c9ce895c215b2b    16384          ->0xe5c9ce895c2156ab
python3.7 80862 jforcier    7   PIPE 0xe5c9ce895c2143eb    16384          ->0xe5c9ce895c2132ab

No dangling! And ironically the existing implementation of stop in Local is a comment saying "nothing to do yet!". So that's probably the easiest bugfix ever? Famous last words...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant