Sunday, September 26, 2010

Oracle Streams Apply Process changes in 11GR2

I’ve setup a simple replication for table t1 from schema src to schema dst.
Changed Apply Server parallelism to 1 and did a simple test with inserting 100 rows while performing a sql trace:



SQL>desc src.t1;

Name Type Nullable Default Comments
---- ------------- -------- ------- --------
N NUMBER
V VARCHAR2(100) Y

SQL> select count(*) from src.t1;

COUNT(*)
----------
0

SQL> select count(*) from dst.t1;

COUNT(*)
----------
0

SQL> select sid from v$streams_apply_server;

SID
----------
22

SQL> exec dbms_monitor.session_trace_enable(22, waits => true);

PL/SQL procedure successfully completed

SQL> insert into src.t1
2 select level, to_char(level)
3 from dual
4 connect by level <= 100;

100 rows inserted SQL> commit;

Commit complete

SQL> select count(*) from dst.t1;

COUNT(*)
----------
100

SQL> exec dbms_monitor.session_trace_disable(22);

PL/SQL procedure successfully completed


******************

To my surprise, when I went to take a look at the trace file, I couldn’t find anything related to Apply process inserting rows into dst.t1, only a handful of internal housekeeping statements.

This made me curious as to where all the stuff really went to so I’ve decided to take a look at v$sql and see if there will be any clues:

******************

SQL> select sql_text
from v$sql
where lower(sql_text) like '%insert%dst%t1%';

SQL_TEXT
--------------------------------------------------------------------------------
select sql_text from v$sql where lower(sql_text) like '%insert%dst%t1%'




************


The only thing I was able to see there was . . . my own statement. Things were starting to look serious.

In order to finally figure out who did what I’ve launched a logminer:



SQL> begin
dbms_logmnr.start_logmnr(
startTime => sysdate-30/1440,
endTime => sysdate,
Options => dbms_logmnr.DICT_FROM_ONLINE_CATALOG+dbms_logmnr.CONTINUOUS_MINE
);
end;
/

PL/SQL procedure successfully completed


*********************


Let’s see who actually inserted all these rows:



SQL> select * from (
select session#, sql_redo
from v$logmnr_contents
where operation='INSERT'
and seg_owner='DST'
and table_name='T1'
order by timestamp desc
) where rownum <= 5;

SESSION# SQL_REDO
---------- --------------------------------------------------------------------------------
22 insert into "DST"."T1"("N","V") values ('1','1');
22 insert into "DST"."T1"("N","V") values ('5','5');
22 insert into "DST"."T1"("N","V") values ('4','4');
22 insert into "DST"."T1"("N","V") values ('3','3');
22 insert into "DST"."T1"("N","V") values ('2','2');



********

Session with SID 22 is nothing else but our Apply Server . . .

The next step was to try and figure out whether we’re really dealing with some new codepath responsible for such spectacular performance (apparently, due to the complete lack of instrumentation :) or this is just another weird Oracle bug.

I’ve blocked the Apply Server on updating a row and looked at the Apply Server’s stack:


************


01 [oracle@ora11gr2 trace]$ pstack 17036
02 #0 0x0000003b83cd450a in semtimedop () from /lib64/libc.so.6
03 #1 0x00000000085ef3f3 in sskgpwwait ()
04 #2 0x00000000085ee5c6 in skgpwwait ()
05 #3 0x000000000829ee31 in ksliwat ()
06 #4 0x000000000829e422 in kslwaitctx ()
07 #5 0x0000000000af92f5 in ksqcmi ()
08 #6 0x00000000082ac019 in ksqgtlctx ()
09 #7 0x00000000082aa77a in ksqgelctx ()
10 #8 0x0000000000c4d566 in ktcwit1 ()
11 #9 0x00000000082d5d99 in kdddgb ()
12 #10 0x00000000082c7530 in kdusru ()
13 #11 0x00000000082c0902 in kauupd ()
14 #12 0x0000000001f57c14 in kddiruUpdate ()
15 #13 0x000000000179eeab in knasdaExecDML ()
16 #14 0x000000000179d928 in knasdaProcDML ()
17 #15 0x000000000178c6fd in knaspd ()
18 #16 0x0000000001787d2f in knasplcr ()
19 #17 0x00000000017866d7 in knaspx ()
20 #18 0x0000000001770fd5 in knalsProc1Txn ()
21 #19 0x000000000177022d in knalsptxn ()
22 #20 0x00000000017424a6 in knasm2 ()
23 #21 0x0000000001776d8d in knalsma ()
24 #22 0x0000000000c25a7d in knlkcbkma ()
25 #23 0x0000000000b93ba7 in ksvrdp ()
26 #24 0x00000000020d2dd7 in opirip ()
27 #25 0x00000000016fe729 in opidrv ()
28 #26 0x0000000001b7183f in sou2o ()
29 #27 0x00000000009d3f8a in opimai_real ()
30 #28 0x0000000001b76ace in ssthrdmain ()
31 #29 0x00000000009d3e71 in main ()

Before we move on, here is a stack dump from a blocked Apply Server in 10.2.0.4:
view source
print?
01 [oracle@ora10gr2 trace]$ pstack 23787
02 #0 0x0000003b83cd450a in semtimedop () from /lib64/libc.so.6
03 #1 0x00000000085ef3f3 in sskgpwwait ()
04 #2 0x00000000085ee5c6 in skgpwwait ()
05 #3 0x000000000829ee31 in ksliwat ()
06 #4 0x000000000829e422 in kslwaitctx ()
07 #5 0x0000000000af92f5 in ksqcmi ()
08 #6 0x00000000082ac019 in ksqgtlctx ()
09 #7 0x00000000082aa77a in ksqgelctx ()
10 #8 0x0000000000c4d566 in ktcwit1 ()
11 #9 0x00000000082d5d99 in kdddgb ()
12 #10 0x00000000082c7530 in kdusru ()
13 #11 0x00000000082c0902 in kauupd ()
14 #12 0x00000000084588c9 in updrow ()
15 #13 0x00000000084f2580 in qerupFetch ()
16 #14 0x0000000008453cdd in updaul ()
17 #15 0x0000000008451bca in updThreePhaseExe ()
18 #16 0x00000000084509f5 in updexe ()
19 #17 0x00000000083fe18f in opiexe ()
20 #18 0x00000000083f5c0d in opiall0 ()
21 #19 0x0000000008403d25 in opikpr ()
22 #20 0x00000000083f78b9 in opiodr ()
23 #21 0x00000000084892af in __PGOSF141_rpidrus ()
24 #22 0x00000000085ee820 in skgmstack ()
25 #23 0x000000000848a759 in rpiswu2 ()
26 #24 0x000000000848fdf4 in kprball ()
27 #25 0x0000000001c7c4d7 in knipxup ()
28 #26 0x0000000001c72651 in knipdis ()
29 #27 0x000000000178cacc in knaspd ()
30 #28 0x0000000001787d2f in knasplcr ()
31 #29 0x00000000017866d7 in knaspx ()
32 #30 0x0000000001770fd5 in knalsProc1Txn ()
33 #31 0x000000000177022d in knalsptxn ()
34 #32 0x00000000017424a6 in knasm2 ()
35 #33 0x0000000001776d8d in knalsma ()
36 #34 0x0000000000c25a7d in knlkcbkma ()
37 #35 0x0000000000b93ba7 in ksvrdp ()
38 #36 0x00000000020d2dd7 in opirip ()
39 #37 0x00000000016fe729 in opidrv ()
40 #38 0x0000000001b7183f in sou2o ()
41 #39 0x00000000009d3f8a in opimai_real ()
42 #40 0x0000000001b76ace in ssthrdmain ()
43 #41 0x00000000009d3e71 in main ()


The stack is only 30 functions deep in 11.2.0.1 compared to 42 in 10.2.0.4! Given that whatever goes up the stack from ktcwit1 () function is due to both sessions waiting on the enqueue, the relative codepath change is even bigger.



All the difference comes from the one key thing: a recursive call. If you take a look at line #25 (highlighted), you’ll notice rpiswu2 () function (for these of you unfamiliar with Oracle Kernel Layers, RPI stands for Recursive Program Interface).


Whatever happens further up the stack is essentially the same codepath any user session would use while executing an UPDATE statement. The Apply Servers in 10.2.0.4 generally behave like any other user session would and whatever diagnostic techniques you have learned while troubleshooting user issues could be, to a large extent, applied to the Apply Servers as well.

Every LCR execution leads to at least one recursive call (so if you got, say, a transaction with 1000 LCRs that would be at least 1000 recursive calls by the Apply Server). In 11.2.0.1 the recursive call is missing and the codepath is different up to kauupd () (KA, Access Layer) function.

Indeed, by looking at the Apply Server statistics in 11.2.0.1 you will notice that executing an LCR no longer results in a recursive call so the entire change seems to be around a shortcut which allows the Apply Server to proceed directly into KD (Data) layer, bypass the “regular” codepath and avoid a recursive call.

On a side note it appears the this new codepath was first introduced in 11.1.0.7.




**************************************


What’s up with the instrumentation?

While performance improvement is certainly most welcome, there is a big downside—all these new functions seems to be poorly, or not-at-all, instrumented. This makes it hard to evaluate the gains, as some stuff is simply not there.



How are you supposed to figure out what’s going on then? The good news is that all regular dynamic performance views (like v$session_wait, v$session_event, etc.) seems to be populated correctly, but sql trace took a big hit (plus you can no longer see sql_id in v$session). Whatever falls out of the “old” stuff looks
like a black box . . . pretty much.


Puzzled by this problem, I’ve tried to see whether there is any easy way to enable the old codepath so you can get all the instrumentation facilities back in place. After some trial and error, it turned out that a simple row level trigger . . .
view source
print?
01 SQL> create or replace trigger dst.buid_t1 before delete or insert or update on dst.t1
02 2 for each row
03 3 begin
04 4 null;
05 5 end;
06 6 /
07
08 Trigger created
09
10 SQL> begin
11 2 dbms_ddl.set_trigger_firing_property('DST', 'BUID_T1', false);
12 3 end;
13 4 /
14
15 PL/SQL procedure successfully completed

. . . is enough to get the old codepath back. Here is the stack of Apply Server process in 11.2.0.1 with such a trigger in place:
view source
print?
01 [oracle@ora11gr2 trace]$ pstack 30640
02 #0 0x0000003b83cd450a in semtimedop () from /lib64/libc.so.6
03 #1 0x00000000085ef3f3 in sskgpwwait ()
04 #2 0x00000000085ee5c6 in skgpwwait ()
05 #3 0x000000000829ee31 in ksliwat ()
06 #4 0x000000000829e422 in kslwaitctx ()
07 #5 0x0000000000af92f5 in ksqcmi ()
08 #6 0x00000000082ac019 in ksqgtlctx ()
09 #7 0x00000000082aa77a in ksqgelctx ()
10 #8 0x0000000000c4d566 in ktcwit1 ()
11 #9 0x00000000082d5d99 in kdddgb ()
12 #10 0x00000000082c7530 in kdusru ()
13 #11 0x00000000082c0902 in kauupd ()
14 #12 0x00000000084588c9 in updrow ()
15 #13 0x00000000084f2580 in qerupFetch ()
16 #14 0x00000000046a363f in qerstFetch ()
17 #15 0x0000000008453cdd in updaul ()
18 #16 0x0000000008451bca in updThreePhaseExe ()
19 #17 0x00000000084509f5 in updexe ()
20 #18 0x00000000083fe18f in opiexe ()
21 #19 0x00000000083f5c0d in opiall0 ()
22 #20 0x0000000008403d25 in opikpr ()
23 #21 0x00000000083f78b9 in opiodr ()
24 #22 0x00000000084892af in __PGOSF141_rpidrus ()
25 #23 0x00000000085ee820 in skgmstack ()
26 #24 0x000000000848a759 in rpiswu2 ()
27 #25 0x000000000848fdf4 in kprball ()
28 #26 0x0000000001c7c4d7 in knipxup ()
29 #27 0x0000000001c72651 in knipdis ()
30 #28 0x000000000178cacc in knaspd ()
31 #29 0x0000000001787d2f in knasplcr ()
32 #30 0x00000000017866d7 in knaspx ()
33 #31 0x0000000001770fd5 in knalsProc1Txn ()
34 #32 0x000000000177022d in knalsptxn ()
35 #33 0x00000000017424a6 in knasm2 ()
36 #34 0x0000000001776d8d in knalsma ()
37 #35 0x0000000000c25a7d in knlkcbkma ()
38 #36 0x0000000000b93ba7 in ksvrdp ()
39 #37 0x00000000020d2dd7 in opirip ()
40 #38 0x00000000016fe729 in opidrv ()
41 #39 0x0000000001b7183f in sou2o ()
42 #40 0x00000000009d3f8a in opimai_real ()
43 #41 0x0000000001b76ace in ssthrdmain ()
44 #42 0x00000000009d3e71 in main ()



Now that looks much more familiar! All the instrumentation appeared to be back in place as well.

I’ve also discovered that:

1. DELETE seems to be always handled through the old codepath.
2. In case you have a unique constraint or a primary key supported by a non-unique index, INSERT will fall back to the old codepath.
3. UPDATE needs a primary key or a key column(-s) supported by an index in order to use the new codepath.


It remains to be seen whether this new codepath has been implemented as a shortcut for most frequently used scenarios, or whether there are some implementation restrictions as it progresses with the future releases . . . or maybe not, due to Golden Gate taking over as a strategic direction.

No comments:

Post a Comment