Project

General

Profile

Bug #5309

Testing for 1.4 release

Added by Joshua Drake over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Alexander Shulgin
Start date:
10/28/2014
Due date:
% Done:

0%

Estimated time:
Resolution:

Description

Ash,

We need a peer review on the latest -head. Specific things we are looking to test:

1. We moved to hard links for the archiver queue
2. We moved to muli-threaded base backups
3. I know that failover does not work correctly in 1.3 (it creates the recovery.done file but postgresql doesn't fail over)
4. I know some of the -A commands don't see to work and sometimes throw exceptions (I tested -Astop)
5. I know that weird combinations of the cmd_standby.ini file sometimes cause errors (such as use_confs)

#1

Updated by Alexander Shulgin over 9 years ago

writes:

We need a peer review on the latest -head. Specific things we are looking to test:

1. We moved to hard links for the archiver queue

Well, it doesn't look like we really did. I've checked latest master
and zamatcmd/pitrtools at github, but I don't see the relevant changes.

2. We moved to muli-threaded base backups

While I'm really late to the party I fail to see how this was a good
idea.

Throwing in more threads/cores would be a win if only CPU was the
bottleneck which is apparently not in the case of base backup. Is there
any evidence that the implemented threaded-rsync actually faster/better
than plain rsync or pg_basebackup?

3. I know that failover does not work correctly in 1.3 (it creates the recovery.done file but postgresql doesn't fail over)
4. I know some of the -A commands don't see to work and sometimes throw exceptions (I tested -Astop)
5. I know that weird combinations of the cmd_standby.ini file sometimes cause errors (such as use_confs)

Will check on these in the meantime.

--
Alex

#2

Updated by Joshua Drake over 9 years ago

On 10/29/2014 07:26 AM, wrote:

Issue #5309 has been updated by Alexander Shulgin.

writes:

We need a peer review on the latest -head. Specific things we are looking to test:

1. We moved to hard links for the archiver queue

Well, it doesn't look like we really did. I've checked latest master
and zamatcmd/pitrtools at github, but I don't see the relevant changes.

Well hell. So that is a task, move it to hard links.

2. We moved to muli-threaded base backups

While I'm really late to the party I fail to see how this was a good
idea.

Throwing in more threads/cores would be a win if only CPU was the
bottleneck
which is apparently not in the case of base backup. Is there
any evidence that the implemented threaded-rsync actually faster/better
than plain rsync or pg_basebackup?

Quite... try backing up a 500GB database and see what happens. We have
shown in production of taking base backups from half days, and multiple
days, to hours or less.

rsync and pg_basebackup (especially pg_basebackup) is very sad in
comparison on anything > 50GB.

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
"If we send our children to Caesar for their education, we should
not be surprised when they come back as Romans."

#3

Updated by Alexander Shulgin over 9 years ago

writes:

Well hell. So that is a task, move it to hard links.

OK, this needs to be a separate ticket, I'll make one.

But before I start, there's a number changes in Zam's repo not merged
back to master. Is he with us? Should I merge them (which ones)?

2. We moved to muli-threaded base backups

While I'm really late to the party I fail to see how this was a good
idea.

Throwing in more threads/cores would be a win if only CPU was the
bottleneck
which is apparently not in the case of base backup. Is there
any evidence that the implemented threaded-rsync actually faster/better
than plain rsync or pg_basebackup?

Quite... try backing up a 500GB database and see what happens. We have
shown in production of taking base backups from half days, and multiple
days, to hours or less.

rsync and pg_basebackup (especially pg_basebackup) is very sad in
comparison on anything > 50GB.

This is interesting, but I don't understand the theory behind it. If
the bottleneck is either bandwidth or disk IO, then how adding more
workers that will be competing for the same (already IO-saturated)
resources really helps it?

--
Alex

#4

Updated by Alexander Shulgin over 9 years ago

writes:

4. I know some of the -A commands don't see to work and sometimes throw exceptions (I tested -Astop)

.../pitrtools-1.3/bin$ ./cmd_standby -C ../etc/cmd_standby.ini -A start
Traceback (most recent call last):
File "./cmd_standby", line 650, in <module>
self.copy_conf()
NameError: name 'self' is not defined

This is the released version 1.3. Doesn't look like it's used by
anyone but us (facepalm).

#5

Updated by Alexander Shulgin over 9 years ago

  • Status changed from New to In Progress

Alexander Shulgin wrote:

This is interesting, but I don't understand the theory behind it. If
the bottleneck is either bandwidth or disk IO, then how adding more
workers that will be competing for the same (already IO-saturated)
resources really helps it?

And if the bottleneck is actually CPU, then it must be due to use of compression. Maybe we should look instead into using rsync with pigz, the parallel gzip?

#6

Updated by Joshua Drake over 9 years ago

On 10/29/2014 09:35 AM, wrote:

Issue #5309 has been updated by Alexander Shulgin.

writes:

4. I know some of the -A commands don't see to work and sometimes throw exceptions (I tested -Astop)

.../pitrtools-1.3/bin$ ./cmd_standby -C ../etc/cmd_standby.ini -A start
Traceback (most recent call last):
File "./cmd_standby", line 650, in <module>
self.copy_conf()
NameError: name 'self' is not defined

This is the released version 1.3. Doesn't look like it's used by
anyone but us (facepalm).

Yeah that is one of them.

--
Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
"If we send our children to Caesar for their education, we should
not be surprised when they come back as Romans."

#7

Updated by Joshua Drake over 9 years ago

On 10/29/2014 10:00 AM, wrote:

Issue #5309 has been updated by Alexander Shulgin.

Status changed from New to In Progress

Alexander Shulgin wrote:

This is interesting, but I don't understand the theory behind it. If
the bottleneck is either bandwidth or disk IO, then how adding more
workers that will be competing for the same (already IO-saturated)
resources really helps it?

And if the bottleneck is actually CPU, then it must be due to use of compression. Maybe we should look instead into using rsync with pigz, the parallel gzip?

I don't think you can because:

1. rsync doesn't compress (as far as I know), ssh does and I don't know
of any way to tell ssh to use a different compression

2. Also keep in mind this isn't about IO saturation as much as single
threaded bottlenecks as a whole. There is only so fast a single thread
can do a single thing and if there is more channels available we should
use them.

That said, I can't stress enough about real world practice here. We have
customers that have taken multiple terrabyte restores from multiple days
to hours using this method.

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
"If we send our children to Caesar for their education, we should
not be surprised when they come back as Romans."

#8

Updated by Alexander Shulgin over 9 years ago

writes:

And if the bottleneck is actually CPU, then it must be due to use of
compression. Maybe we should look instead into using rsync with
pigz, the parallel gzip?

I don't think you can because:

1. rsync doesn't compress (as far as I know), ssh does and I don't know
of any way to tell ssh to use a different compression

Well, rsync doesn't compress by default (as well as ssh).

2. Also keep in mind this isn't about IO saturation as much as single
threaded bottlenecks as a whole. There is only so fast a single thread
can do a single thing and if there is more channels available we should
use them.

That said, I can't stress enough about real world practice here. We have
customers that have taken multiple terrabyte restores from multiple days
to hours using this method.

What I'm talking about is that it's weird to invent a new tool that
tries to fool rsync while a combination of existing tools can do the
job, e.g:

ssh master -c "tar cf - /path/to/pgdata | pigz -" | gunzip - | tar xf -

This is goint to work great for initial base backups, while for updating
one might just use the plain rsync (assumming relatively small diffs).

--
Alex

#9

Updated by Alexander Shulgin over 9 years ago

writes:

3. I know that failover does not work correctly in 1.3 (it creates the recovery.done file but postgresql doesn't fail over)

Well, this one worked for me. Do you recall any details or the most
recent instance of this problem?

--
Alex

#10

Updated by Joshua Drake over 9 years ago

On 10/30/2014 08:52 AM, wrote:

Issue #5309 has been updated by Alexander Shulgin.

writes:

3. I know that failover does not work correctly in 1.3 (it creates the recovery.done file but postgresql doesn't fail over)

Well, this one worked for me. Do you recall any details or the most
recent instance of this problem?

On db04 of kcls, I have tried to fail it over a number of times and it
doesn't do it. All I did was cmd_standby -C cmd_standby.ini -F999

JD

--
Alex

----------------------------------------
Bug #5309: Testing for 1.4 release
https://public.commandprompt.com/issues/5309#change-31525

  • Author: Joshua Drake
  • Status: In Progress
  • Priority: Normal
  • Assignee: Alexander Shulgin
  • Category:
  • Target version:
  • Resolution:
    ----------------------------------------

--
Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
"If we send our children to Caesar for their education, we should
not be surprised when they come back as Romans."

#11

Updated by Alexander Shulgin over 9 years ago

writes:

3. I know that failover does not work correctly in 1.3 (it creates the recovery.done file but postgresql doesn't fail over)

Well, this one worked for me. Do you recall any details or the most
recent instance of this problem?

On db04 of kcls, I have tried to fail it over a number of times and it
doesn't do it. All I did was cmd_standby -C cmd_standby.ini -F999

On db04 it looks like configuration issue.

--
Alex

#12

Updated by Alexander Shulgin over 9 years ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF