| Summary: | GIOPConnection is not safe against malicious or malfunctioned peers | ||
|---|---|---|---|
| Product: | JacORB | Reporter: | Marc Heide <marc.heide> |
| Component: | ORB | Assignee: | André Spiegel <spiegel> |
| Status: | RESOLVED FIXED | ||
| Severity: | normal | CC: | asi, crotwell, kent.au, Richard_Ridgway, rnc |
| Priority: | P2 | ||
| Version: | 2.3.0 beta 2 | ||
| Hardware: | All | ||
| OS: | All | ||
| Attachments: |
patches getMessage() and sendMessage() to close connection in described situations
patch for deadlock problem introduced by former patches (against 2.3 beta2) test client/server for deadlock/timeout patch against 2.3 beta to remove all calls to close() operation Solve potential deadlock in GIOPConnection. Solve potential deadlock in GIOPConnection. Last update 08 Jun 2009. |
||
|
Description
Marc Heide
2006-07-20 09:34:14 UTC
Created attachment 243 [details]
patches getMessage() and sendMessage() to close connection in described situations
Applied both parts of the patch on CVS HEAD. Thanks. There was an error introduced in promoted and applied patch for GIOPConnection. This error may lead to deadlocks between reader and writer threads which might try to close this socket concurrently. This is because 2 sync Objects are relevant for closing the socket: connect_sync write_sync they get allocated by implementation of streamClosed of ClientGIOPConnection in order: connect_sync -> write_sync (reader thread) but while sending a message via sendMessage they get allocated in order: write_sync -> connect_sync (writer thread already has write_lock if he calls streamClosed) solution: call releaseWriteLock() before calling streamClosed() in sendMessage() in error situations. applied patch is against 2.3 beta2 Created attachment 253 [details]
patch for deadlock problem introduced by former patches (against 2.3 beta2)
Second patch is installed. Thanks, Marc. *** Bug 701 has been marked as a duplicate of this bug. ***
I believe that there is still a deadlock situation caused by this code. Here is
the test I did.
First, in StreamConnectionBase.java, about line 126, I changed it to be this, so
that every so often it throws a IOException that looks like the one we were
seeing in our clients. Unfortnantly, the real sockect exception has proven hard
to reproduce, but I believe this test is a good first test. I only did this in
the client's jacorb.jar.
Next I ran a simple "hello world" client and server (attached). The client orb
sees the IOException, marks the socket as bad, and creates a new socket to the
server successfully. However, the local call that triggered the IOException
doesn't return with the COMM_FAILURE and the client hangs until the client
timeout is reached. Subsequent calls to the same server also timeout.
The output of both the client and server is in client.out and server.out in the
tarball, as well as a thread dump on the client showing the deadlocked threads.
It is not clear to me how to resolve this, but at least there is a test case to
demonstrate it.
StreamConnectionBase.java, line 126:
int USC_count = 0;
public void write (boolean is_first,
boolean is_last,
byte[] data,
int offset,
int length,
long time_out )
{
try
{
if (USC_count++ % 10 == 9) {
java.net.SocketException s = new
java.net.SocketException("Broken pipe");
java.io.IOException io = new java.io.IOException();
io.initCause(s);
System.out.println("Fake 'Broken pipe' "+io.getMessage());
throw io;
}
Created attachment 256 [details]
test client/server for deadlock/timeout
Yes you are right, at least the client locks up, but this is not caused by this code, its caused by wrong behavior in Delegate.java. The Delegate simply ignores the COMM_FAILURE exception and on next call the Delegate is still in state: bound=true, which means simply send the request (including opening a new transport). BUT: the previous close() has forced the MessageReceptor (which reads answers from socket) to quit and so there is no Reader any more, thats why your ReplyReceivers hangs forever. Possible Solutions Delegate should be adapted to handle COMM_FAILURE exceptions like TRANSIENT exceptions (try_rebind), would be best solution. As a workaround you could throw away a narrowed object reference (to garbage collection) once you got a COMM_FAILURE exception in client code, but this cannot be the final solution.
I tried the idea of making the Delegate treat COMM_FAILUREs in the same way as
TRANSIENTs, ie try_rebind(), but it did not seem to help.
I addition, I tried creating a new ref by reloading each corba object from the
ior via string_to_object, and this seemed to work occassionally but not always.
I expect there is a race condition with the garbage collector such that if the
gc cleans the old object in time, then there will be a new Delegate, but if not
then you are still dealing with the old one. Since you can never force an object
to be garbage collected, this doesn't seem like a good solution.
Here is the client side logging with debug messages, starting at the point when
the fake 'Broken pipe' is generated, till it hit the timeout.
[jacorb.orb.giop] INFO : ClientConnectionManager: found ClientGIOPConnection to
129.252.35.30:9876 (3da850)
Fake 'Broken pipe' null
[jacorb.orb.iiop] DEBUG : Caught exception
java.io.IOException
at
org.jacorb.orb.etf.StreamConnectionBase.write(StreamConnectionBase.java:139)
at org.jacorb.orb.giop.GIOPConnection.write(GIOPConnection.java:887)
at org.jacorb.orb.CDROutputStream.write(CDROutputStream.java:438)
at
org.jacorb.orb.giop.ServiceContextTransportingOutputStream.write_to(ServiceContextTransportingOutputStream.java:148)
at
org.jacorb.orb.giop.RequestOutputStream.write_to(RequestOutputStream.java:251)
at org.jacorb.orb.giop.GIOPConnection.sendMessage(GIOPConnection.java:971)
at org.jacorb.orb.giop.GIOPConnection.sendRequest(GIOPConnection.java:923)
at
org.jacorb.orb.giop.ClientConnection.sendRequest(ClientConnection.java:323)
at
org.jacorb.orb.giop.ClientConnection.sendRequest(ClientConnection.java:304)
at org.jacorb.orb.Delegate.invoke_internal(Delegate.java:1024)
at org.jacorb.orb.Delegate.invoke(Delegate.java:939)
at org.omg.CORBA.portable.ObjectImpl._invoke(ObjectImpl.java:457)
at demo.hello._GoodDayStub.hello_simple(_GoodDayStub.java:32)
at demo.hello.Client$1.run(Client.java:26)
at java.lang.Thread.run(Thread.java:613)
Caused by: java.net.SocketException: Broken pipe
at
org.jacorb.orb.etf.StreamConnectionBase.write(StreamConnectionBase.java:138)
... 14 more
[jacorb.giop.conn] ERROR : Failed to write GIOP message due to COMM_FAILURE
org.omg.CORBA.COMM_FAILURE: IOException: java.io.IOException vmcid: 0x0 minor
code: 0 completed: No
at
org.jacorb.orb.etf.ConnectionBase.to_COMM_FAILURE(ConnectionBase.java:143)
at
org.jacorb.orb.etf.StreamConnectionBase.write(StreamConnectionBase.java:153)
at org.jacorb.orb.giop.GIOPConnection.write(GIOPConnection.java:887)
at org.jacorb.orb.CDROutputStream.write(CDROutputStream.java:438)
at
org.jacorb.orb.giop.ServiceContextTransportingOutputStream.write_to(ServiceContextTransportingOutputStream.java:148)
at
org.jacorb.orb.giop.RequestOutputStream.write_to(RequestOutputStream.java:251)
at org.jacorb.orb.giop.GIOPConnection.sendMessage(GIOPConnection.java:971)
at org.jacorb.orb.giop.GIOPConnection.sendRequest(GIOPConnection.java:923)
at
org.jacorb.orb.giop.ClientConnection.sendRequest(ClientConnection.java:323)
at
org.jacorb.orb.giop.ClientConnection.sendRequest(ClientConnection.java:304)
at org.jacorb.orb.Delegate.invoke_internal(Delegate.java:1024)
at org.jacorb.orb.Delegate.invoke(Delegate.java:939)
at org.omg.CORBA.portable.ObjectImpl._invoke(ObjectImpl.java:457)
at demo.hello._GoodDayStub.hello_simple(_GoodDayStub.java:32)
at demo.hello.Client$1.run(Client.java:26)
at java.lang.Thread.run(Thread.java:613)
[jacorb.giop.conn] ERROR : GIOP connection closed due to errors during sendMessage
[jacorb.giop.conn] DEBUG : ClientGIOPConnection to 129.252.35.30:9876 (3da850):
close()
[jacorb.giop.conn] WARN : Abnormal connection termination. Lost 1 outstanding
replie(s)!
[jacorb.orb.iiop] INFO : Client-side TCP transport to 129.252.35.30:9876 closed.
[jacorb.giop.conn] DEBUG : ClientGIOPConnection to 129.252.35.30:9876 (3da850):
streamClosed()
[jacorb.giop.conn] DEBUG : ClientGIOPConnection to 129.252.35.30:9876 (3da850):
closeAllowReopen()
[jacorb.orb.delegate] DEBUG : invoke: SystemException
[jacorb.orb.delegate] DEBUG : Delegate.try_rebind
Caught exception in 18th loopIOException: java.io.IOException
[jacorb.orb.iiop] DEBUG : Transport to 129.252.35.30:9876: stream closed Socket
closed
[jacorb.orb.iiop] DEBUG : Caught exception
java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at
org.jacorb.orb.etf.StreamConnectionBase.read(StreamConnectionBase.java:81)
at org.jacorb.orb.giop.GIOPConnection.getMessage(GIOPConnection.java:393)
at
org.jacorb.orb.giop.GIOPConnection.receiveMessages(GIOPConnection.java:517)
at org.jacorb.orb.giop.MessageReceptor.doWork(MessageReceptor.java:71)
at org.jacorb.util.threadpool.ConsumerTie.run(ConsumerTie.java:61)
at java.lang.Thread.run(Thread.java:613)
[jacorb.giop.conn] DEBUG : ClientGIOPConnection to 129.252.35.30:9876 (3da850):
getMessage() -- COMM_FAILURE
[jacorb.giop.conn] DEBUG : ClientGIOPConnection to 129.252.35.30:9876 (3da850):
streamClosed()
[jacorb.giop.conn] DEBUG : ClientGIOPConnection to 129.252.35.30:9876 (3da850):
closeAllowReopen()
[jacorb.util.tpool] DEBUG : [1/1] job queue empty
[jacorb.orb.delegate] DEBUG : Delegate.getReference with POA < empty>
[jacorb.orb.delegate] DEBUG : Delegate.getReference with POA < empty>
[jacorb.orb.giop] INFO : ClientConnectionManager: found ClientGIOPConnection to
129.252.35.30:9876 (3da850)
[jacorb.giop.conn] DEBUG : ClientGIOPConnection to 129.252.35.30:9876 (3da850):
sendMessage() -- opening transport
[jacorb.orb.iiop] DEBUG : Trying to connect to 129.252.35.30:9876 with
timeout=90000.
[jacorb.orb.giop] DEBUG : ClientConnectionManager: cannot release
ClientGIOPConnection to 129.252.35.30:9876 (3da850) (still has 4 client(s))
[jacorb.orb.delegate] DEBUG : Delegate released!
[jacorb.orb.giop] DEBUG : ClientConnectionManager: cannot release
ClientGIOPConnection to 129.252.35.30:9876 (3da850) (still has 3 client(s))
[jacorb.orb.delegate] DEBUG : Delegate released!
[jacorb.orb.giop] DEBUG : ClientConnectionManager: cannot release
ClientGIOPConnection to 129.252.35.30:9876 (3da850) (still has 2 client(s))
[jacorb.orb.delegate] DEBUG : Delegate released!
[jacorb.orb.iiop] INFO : Connected to 129.252.35.30:9876 from local port 54778
Caught exception in 19th loopclient timeout reached
[jacorb.orb.delegate] DEBUG : Delegate.getReference with POA < empty>
[jacorb.orb.delegate] DEBUG : Delegate.getReference with POA < empty>
[jacorb.orb.giop] INFO : ClientConnectionManager: found ClientGIOPConnection to
129.252.35.30:9876 (3da850)
well, after some testing with your client and mine, i think i found the root cause of this behavior. The GIOPConnection and the ClientConnection on top of it, are designed as lets say "outliving" objects, so calling close() on GIOPConnection should not be done from lower layers or GIOPConnection itself. Instead to indicate a problem with underlaying transport it is quite sufficient just to call streamClosed() operation and leave the real close() for outer objects (Delegates or GIOPConnectionManagers). Calling close() before all Delegates have finished their work with this Connection leads to error situations where a new transport connection is opened but no MessageReceiver is created to read the answers. So i simply removed all calls to close()introduced by my patches from GIOPConnection and now it seems to work fine. See my attachement. Created attachment 257 [details]
patch against 2.3 beta to remove all calls to close() operation
are these patches included in the 2.3 - release? couldn't find it in the release notes... maybe someone could tell me... regards... christian in addition to my previous entry... i compared 2.3 beta (unpatched) to the patched 2.3 beta and to 2.3 final and found a mix of both old and patched code... so i'm a little bit confused about this... are there any comments to the patch from 12/11/06 and/or 2.3 ? regards chris The latest patch from 12/11/06 is not part of 2.3.0 But with this patch it now works very stable here. Regards Marc I have installed the final patch from 2006-12-11 into the main code base now. Sorry for the delay in getting this sorted out. It will be part of the next release. Dear all,
A deadlock is found in my environment. JacORB 2.3 with la patched (removed all calls to close() from GIOPConnection). Here is the log message.
Name: RMI TCP Connection(103)-10.0.10.101
Status: WAITING at java.lang.Object@20d65
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:474)
org.jacorb.orb.giop.GIOPConnection.getWriteLock(GIOPConnection.java:812)
org.jacorb.orb.giop.ClientGIOPConnection.closeAllowReopen(ClientGIOPConnection.java:130)
org.jacorb.orb.giop.ClientGIOPConnection.streamClosed(ClientGIOPConnection.java:106)
org.jacorb.orb.giop.GIOPConnection.sendMessage(GIOPConnection.java:989)
org.jacorb.orb.giop.GIOPConnection.sendRequest(GIOPConnection.java:900)
org.jacorb.orb.giop.ClientConnection.sendRequest(ClientConnection.java:323)
org.jacorb.orb.giop.ClientConnection.sendRequest(ClientConnection.java:304)
org.jacorb.orb.Delegate.invoke_internal(Delegate.java:1024)
org.jacorb.orb.Delegate.invoke(Delegate.java:939)
org.omg.CORBA.portable.ObjectImpl._invoke(ObjectImpl.java:457)
CustomerModule._CustomerStub.getProvisioning(_CustomerStub.java:485)
...
Name: RMI TCP Connection(91)-10.0.10.101
Status: BLOCKED at java.lang.Object@5e320f ,Owner: RMI TCP Connection(103)-10.0.10.101
Stack:
org.jacorb.orb.giop.GIOPConnection.sendMessage(GIOPConnection.java:930)
org.jacorb.orb.giop.GIOPConnection.sendRequest(GIOPConnection.java:900)
org.jacorb.orb.giop.ClientConnection.sendRequest(ClientConnection.java:323)
org.jacorb.orb.giop.ClientConnection.sendRequest(ClientConnection.java:304)
org.jacorb.orb.Delegate.invoke_internal(Delegate.java:1024)
org.jacorb.orb.Delegate.invoke(Delegate.java:939)
org.omg.CORBA.portable.ObjectImpl._invoke(ObjectImpl.java:457)
CustomerModule._CustomerStub.getProvisioning(_CustomerStub.java:485)
...
My workaround solution is
1) removed all calls to close() as it is
2) create a new method streamClosedWithoutWriteLock() {just remove the call to getWriteLock()}
It works.
Kent
Hi All, Please let me know if this issue is similar to http://www.jacorb.org/cgi-bin/bugzilla/show_bug.cgi?id=858 and the same solution can be applied here. Kent, can you explain where to implement the method streamClosedWithoutWriteLock() ? Created attachment 363 [details]
Solve potential deadlock in GIOPConnection.
My test env. is a client on Windows XP along with TcpView + a server on Linux. Test step: 1) start up client, server and TcpView 2) triger a request from client, that will create a connection 3) close the connection with TcpView Expect result: a new connection will be created and system still function.
Dear Abhishek Singh, I'm not 100% sure it is as same as #858. I've got COMM_FAILURE (broken pipe) in my software only when there is high stress. But I don't know why. My solution is using TcpView to close conection to mock one, and try to fix it. I've test my patch yesterday and have some minor turning. See attachment. Hope it can work. Kent Created attachment 364 [details]
Solve potential deadlock in GIOPConnection. Last update 08 Jun 2009.
A bad mews is the pervious patch does not survive during a weekend's stability test. There is still a deadlock withing write_lock and connect_sync. Thus I made another patch. This time, my strategy is getting write_lock before sync connect_sync when we need both of them. See the attachment for details.
FYI, the patch on 8 Jun works perfectly so far. Cheers! Reopen to make sure we get Kent's final patch into main code base. Final part of patch is now committed (only the synchronization part in ClientGIOPConnection, as the other changes had already been applied). (In reply to comment #24) > Final part of patch is now committed (only the synchronization part in > ClientGIOPConnection, as the other changes had already been applied). > Andre, I think there is still a problem here. I've been working with a customer who has a heavily threaded client, which happens to do quite a lot of Name Service resolving. They ran into a race condition that your fix ought to remedy, but I have thought of another possible problem. They have many, perhaps a dozen, threads all waiting to send a request to the same name service. This means a whole bunch of threads in GIOPConnection.getWriteLock(), called from GIOPConnectoin.sendMessage(). Something happens and causes a COMM_FAILURE to be thrown to the thread that got the write lock and tried to send. This thread then calls releaseWriteLock() on its way to calling ClientGIOPConnection.closeAllowReopen(). A new thread gets to return from getWriteLock() and proceeds to try to send its message. However the transport is corrupt and unclosed, since the first thread hasn't had a chance to close the transport as it is blocked in getWriteLock(). So a second COMM_FAILURE is thrown, and another thread enters closeAllowReopen() and blocks. Since there are many threads waiting for the write lock, this scenario can occur many times before a thread in closeAllowReopen() actually gets the write lock and proceeds to close the transport. Now at this point, you've got at least one thread waiting in closeAllowReopen(). It is entirely possible that you also have threads waiting from sendMessage(). Lets say one of those gets a chance to go, it successfully opens a new transport, sends its request, and releases the write lock. Now, one of the threads waiting to close gets the write lock and closes the transport, which by this time will be the new one just opened. Oops! Now you have a client thread that successfully sent a message waiting forever (or timeout) for a reply it will never receive because another thread closed the transport out from under it. I've finally had a chance to test my theory about an additional deadlock scenario. I now believe this is not a problem, as sendMessage() calls ClientGIOPConnection.streamClosed() which calls ConnectionListener.streamClosed() after closeAllowReopen(). This ensures that even if a race as I hypothesized occurs, the result will be a COMMFAIL exception rather than a deadlock. Now, I have observed that it is possible to have multiple threads waiting in closeAllowReopen(), and even for a thread to close the stream and have another reopen, to have a third previously blocked in closeAllowReopen() to then close the stream again. It might be possible to do something a little more efficient here to ensure a "reclose" doesn't happen but I think the conditions leading to this are fairly rare so I won't propose any change now. *** Bug 759 has been marked as a duplicate of this bug. *** *** Bug 858 has been marked as a duplicate of this bug. *** |