如何在Amazon EC2上运行python守护程序的多个并行实例连接到RDS PostgreSQL数据库和S3存储桶?

问题描述

我有一个EC2实例,在其中运行实现以下步骤的python守护程序:

  1. 发送 API请求,以启动数据的点击流;
  2. 对于收到的每条记录,以JSON格式,对数据应用一些转换,并保存在字符串中,并在收到大量记录后,将其转储为CSV strong>本地格式;
  3. 运行连接到RDS Postgresql db的子进程,并执行copY命令以从先前本地转储的CSV中加载数据;
  4. 将当前CSV加载到S3存储桶,然后从EC2实例中删除

所有步骤都运行良好;问题是,守护程序的单个实例不足以保持 near 的实时速度,因为几个小时后,从API获得记录的时间到从请求并转储到数据库中太大。

鉴于此,当运行守护程序的多个实例(无论是2还是8)时,实际上只有一个正在运行,cpu%约为11%。 我想要实现的是让所有实例都以相似的cpu使用率启动并运行。

除了启动守护程序的不同实例外,我还尝试过:

  • 使用python多重处理来启动不同的进程;
  • 使用 taskset 命令为每个进程分配一个特定的cpu
  • 使用 parallel 命令强制并行执行。

有关体系结构的详细信息:

  • EC2实例c5.2xlarge(8个vcpu,8 GB内存,仅EBS实例存储,网络带宽高达10 Gbps,EBS带宽高达4750 Mbps);
  • x86_64-pc-linux-gnu上的Postgresql 11.6,由gcc(GCC)4.8.5 20150623(Red Hat 4.8.5-11)编译,64位。最大连接值:5000。

实现了守护程序类:

import sys,os,time,atexit,signal

class Daemon:
    """A generic daemon class.

    Usage: subclass the daemon class and override the run() method."""

    def __init__(self,pidfile,stdin,stdout,stderr): 
        self.pidfile = pidfile
        self.stdin = stdin
        self.stdout = stdout
        self.stderr = stderr
    
    def daemonize(self):
        """Deamonize class. UNIX double fork mechanism."""

        try: 
            pid = os.fork() 
            if pid > 0:
                # exit first parent
                sys.exit(0) 
        except OSError as err: 
            sys.stderr.write('fork #1 Failed: {0}\n'.format(err))
            sys.exit(1)
    
        # decouple from parent environment
        os.chdir('/') 
        os.setsid() 
        os.umask(0) 
    
        # do second fork
        try: 
            pid = os.fork() 
            if pid > 0:

                # exit from second parent
                sys.exit(0) 
        except OSError as err: 
            sys.stderr.write('fork #2 Failed: {0}\n'.format(err))
            sys.exit(1) 
    
        # redirect standard file descriptors
        sys.stdout.flush()
        sys.stderr.flush()
        
        si = open(self.stdin,'r')
        so = open(self.stdout,'a+')
        se = open(self.stderr,'a+')

        os.dup2(si.fileno(),sys.stdin.fileno())
        os.dup2(so.fileno(),sys.stdout.fileno())
        os.dup2(se.fileno(),sys.stderr.fileno())
    
        # write pidfile
        atexit.register(self.delpid)

        pid = str(os.getpid())
        with open(self.pidfile,'w+') as f:
            f.write(pid + '\n')
    
    def delpid(self):
        os.remove(self.pidfile)

    def start(self):
        """Start the daemon."""

        # Check for a pidfile to see if the daemon already runs
        try:
            with open(self.pidfile,'r') as pf:

                pid = int(pf.read().strip())
        except IOError:
            pid = None
    
        if pid:
            message = "pidfile {0} already exist. " + \
                    "Daemon already running?\n"
            sys.stderr.write(message.format(self.pidfile))
            sys.exit(1)
        
        # Start the daemon
        self.daemonize()
        self.run()

    def stop(self):
        """Stop the daemon."""

        # Get the pid from the pidfile
        try:
            with open(self.pidfile,'r') as pf:
                pid = int(pf.read().strip())
        except IOError:
            pid = None
    
        if not pid:
            message = "pidfile {0} does not exist. " + \
                    "Daemon not running?\n"
            sys.stderr.write(message.format(self.pidfile))
            return # not an error in a restart

        # Try killing the daemon process    
        try:
            while 1:
                os.kill(pid,signal.SIGTERM)
                time.sleep(0.1)
        except OSError as err:
            e = str(err.args)
            if e.find("No such process") > 0:
                if os.path.exists(self.pidfile):
                    os.remove(self.pidfile)
            else:
                print (str(err.args))
                sys.exit(1)

    def restart(self):
        """Restart the daemon."""
        self.stop()
        self.start()

    def run(self):
        """You should override this method when you subclass Daemon.
        
        It will be called after the process has been daemonized by 
        start() or restart()."""

用于提取到Postgresql代码段:

def ingestion(str: csv_fullpath_ec2):
    psql_code = "\copy adobe_ls.adobe_livestream_stg(TABLE_COLUMNS) " \
              + "FROM '" + csv_fullpath_ec2 + "' " \
              + "WITH (FORMAT CSV,DELIMITER '" + delimiter + "',QUOTE E'\b');"
    cmd = ['/usr/bin/psql','-U',DB_USERNAME,'-h',HOST_NAME,'-p',PORT,'-d',DB_NAME,'-c','{}'.format(psql_code)]

    result = subprocess.run(cmd,env,stdout=subprocess.PIPE,stderr=subprocess.PIPE)

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)