Welcome to TheGillis.net

Consider this site a collection of random notes about a variety of topics. I hope this information helps you in some way.

17 November 2005 - 23:12Google Sitemap Large Log File Fix

Today I was asked to take a look at an error in the Google Sitemap Tool. The problem was when this Python script was run by Mike, he was getting an malloc error. After some quick searching of the code and some newsgroups, I narrowed the problem to a loop that loops through the log lines. This is NOT a Python error as some in the newsgroups have suggested. There is a simple fix by simply altering the loop to to process each line rather than loading the entire file into memory. If you’re just interested in the patch, it can be found here. Simple instructions can be found here. Read on for the detailed information.

Problem

The error Mike was getting while he attempted to parse a 1 GB apache log file was:

Reading configuration file: fpux_config.xml
Walking DIRECTORY "/home/lxcoda/fpux.com/html/"
Opened ACCESSLOG file: /extra/logs/lxcoda/fpux.com/fpux.com-access.log
Traceback (most recent call last):
  File "sitemap_gen.py", line 2194, in ?
    sitemap.Generate()
  File "sitemap_gen.py", line 1775, in Generate
    input.ProduceURLs(self.ConsumeURL)
  File "sitemap_gen.py", line 1115, in ProduceURLs
    for line in file.readlines():
MemoryError

No logs were getting parsed by that error.

Solution

A quick look a the code, and I found the line 1115 of sitemap_gen.py

for line in file.readlines():
   ...

This offending line reads all lines into memory and places them into an anonymous variable. The for loop cycles through each of these in memory lines and they are parsed. This can be easily changed to:

line = file.readline()
while line:
   ...
   line = file.readline()

This solves the problem of handling large log files. It also should not cause any negative side effects.

Conclusion

This is a good example of where scalability issues can emerge and can be easily corrected with only minor fixes. Under normal circumstances, the above lines are completely correct, however realizing that it should not be done takes time and experience. In this example, Python used up to approximately 400 MB before the memory error before the patch, and now uses approximately 15 MB throughout its execution.

The patch can be found at http://www.thegillis.net/examples/misc/sitemap.large.fix.patch

To apply the patch, download it to the sitemap folder and run “patch sitemap_gen.py sitemap.large.fix.patch”

No Comments | Tags: Programming

Add a Comment