The best technique I have found is to defer database writes by writing to a temporary textfile in memory (using php://memory) and then using LOAD DATA INFILE to write contents to the database. That puts all the email data into a database queue, allowing a cron task to send the emails - say 50 per minute.
Using this technique I can send 50,000+ emails quite easily. The queue function takes under a minute to run and the cron job hacks away at the queue for about 24 hours or so.
The problem I have with this technique is that it evolved via a process of whittling away at the various framework layers used in my "enterprise applications". Framework layers supposedly designed to provide scalability.
Now you can probably tell from this blog that I use Symfony extensively (although I have dabbled with other frameworks, I always come back to Symfony). Perhaps it is due to the fact that my largest projects (over 1,000,000 lines of code) are locked into Symfony 1.0 and - more importantly - Propel 1.2. Arguably, what I should be doing is writing a Symfony plugin that performs large data processing using the techniques I've described, but it seems a shame to have a detailed, documented and unit-tested ORM that can only be used for serving small jobs like "web pages", and must be bypassed for any data intensive stuff. Isn't that a core requirement of "enterprise level" scalability?
Anyway, I'll try to end the rant with some code. Last night I wrote a custom yaml parser because the sfYaml::load() method in Symfony 1.0 (which uses spyc) was taking over 20 minutes to load a 1MB yaml file. I managed to load it in about 5 seconds. Now, I'll admit it's not really a yaml parser but more of a key-pair extractor but because it doesn't use any regex it is waay faster than the old sfYaml.
(NB: I couldn't install syck on my CentOS4 server and the newer sfYaml component from 1.2 failed to load yaml collections in the format shown below).
The yaml (multiply this by 10,000):
Recipients:
- title: 'Mr'
firstname: 'Bob'
lastname: 'Dobbs'
company: 'Subgenius Network'
email: 'bob@subgenius.com'
- title: 'Mrs'
firstname: 'Jane'
lastname: 'Dobbs'
company: 'Subgenius Network'
email: 'jane@subgenius.com'
The recipient extractor method:
public static function extractRecipients($yaml)
{
//separate recipients from yaml
$recipients_string = "Recipients:";
$recipients_start = strpos($yaml,$recipients_string);
$recipients_end = strpos($yaml,"bcc:");
$recipients_yaml = substr($yaml,$recipients_start,
($recipients_end-$recipients_start));
$yaml_start = substr($yaml,0,$recipients_start);
$yaml_end = substr($yaml,$recipients_end);
$yaml = $yaml_start . $yaml_end;
//load yaml
$yaml_array = sfYaml::load($yaml);
//get recipients array
$recipients_yaml = str_replace($recipients_string,"",
$recipients_yaml);
$recipients_array = split("-",$recipients_yaml);
//loop over recipients to get getails as array
$recipients_list = array();
foreach ($recipients_array as $r1) {
//loop over each line item
$r1_array = split("\r\n",$r1);
$my_recipient = array();
foreach ($r1_array as $r2) {
if (strpos($r2,":")!==false){
//extract key pair from line item
$r2_array = split(":",$r2);
if (sizeof($r2_array) == 2){
$key = str_replace("''","'",
trim(trim($r2_array[0]),"'"));
$value = str_replace("''","'",
trim(trim($r2_array[1]),"'"));
$my_recipient[$key] = $value;
}
}
}
if (sizeof($my_recipient) > 1){
//add this recipient to list
$recipients_list[] = $my_recipient;
}
}
//add recipients to yaml data
$yaml_array[trim($recipients_string,":")] = $recipients_list;
//return yaml data
return $yaml_array;
}
Yes you do need a physical file for LOAD DATA INFILE. I use php://memory *inside* the large loop. Then, after the loop finishes, dump the memory to a file and load it from there. The reason why I use memory in the loop is that it is faster than writing to disk, and the less time in the loop the better!
You mention that you use the php://memory and then get it into the DB using LOAD DATA INFILE. How exactly does this work? Doesn't LOAD DATA INFILE require a physical file?
Good point - and you're absolutely right. Yaml is great for configuration files but not so great for large chunks of data.
The reason YAML is used in this app is because it is coming out of a MS Access system. We are using it as a lightweight replacement for XML and, until recently, it has been sensational.
The problem arose when the client decided that the same process should be used for sending bulk emails - something the system was not designed for and, subsequently, cannot handle.
In trying to adapt the system it has become clear that:
* Propel is too memory intensive to be used in large quantities
* YAML is too slow for large data sets
So, when I've decided on how to solve this little problem I'll post it here!
I'm curious as to why you would store data like this in a yml file and not a database. It seems that even a sqlite database would be far superior to a text file for an amount of data like this.
Comments are not available for this entry.



Avatar



