Exporting milions of rows via CSV
13 Feb 2015Exporting CSV seems like a solved issue in the Ruby on Rails framework and to a large extent it is. There are large amounts of tutorials available on how to implement an efficient CSV export solution with support for http streaming. There are two issues with implementing a CSV export in Rails.
- It takes a considerable amount of time
- It takes a considerable amount of memory (if one line of the CSV represents multiple models)
The solutions to both of these problems is to do your CSV export in the database and let Rails only serve the response. We’ll be using the Postgresql support for importing and exporting CSVs.
If you want to use the code discussed here, the base modules are available here
Controller
Lets start by implementing response streaming the same way we would if we used Rails to construct the CSV response.
class CsvExportController < ApplicationController
def index
respond_to do |format|
format.csv { render_csv }
end
end
private
def selected_items
# implementation omited
end
def csv_lines
Items::CsvExport.new(selected_items)
end
def render_csv
set_file_headers
set_streaming_headers
response.status = 200
self.response_body = csv_lines
end
def set_file_headers
headers['Content-Type'] = 'text/csv; charset=UTF-16LE'
headers['Content-disposition'] = 'attachment;'
headers['Content-disposition'] += " filename=\"#{file_name}.csv\""
end
def set_streaming_headers
headers['X-Accel-Buffering'] = 'no'
headers["Cache-Control"] ||= "no-cache"
headers.delete("Content-Length")
end
def file_name
'a_big_export'
end
end
There is a lot going on here, but we won’t go into details, because other posts cover the topic. The most important thing to grasp is that the controller sets the response body to the return value of the csv_lines method which is an object that implements the Enumerable interface. The controller will iterate over that object streaming each row it gets.
Lets have a look at the Exporter class.
Exporter
module Items
class CsvExport
include CsvBase::CsvBase
def initialize(items)
@items = items
end
private
attr_reader :items
def header
CSV.generate_line(['a column', 'another column'], col_sep: "\t").to_s
end
def export_columns
['items.a_column', 'related_class.another_column']
end
def export_sql
items.select(export_columns)
end
end
end
This class defines the particulars of how to get the data for the export. It receives an ActiveRecord relation object which it uses to construct an export sql query. The heavy lifting is done by the CsvBase::CsvBase module.
It is important to keep in mind that the implementation of the exporter class is dependent on the application and this example is provided only as a template of how an implementation can look like.
CsvBase
The CsvBase::CsvBase module is designed to be included in a class and uses methods defined on that class in its #each template method to provide the enumerator with data and is application independent. Currently the assumption is that the exported CSV is going to be opened by Excel in windows. It is however trivial to abstract that asumption away.
module CsvBase
module CsvBase
BOM = "\377\376".force_encoding('UTF-16LE')
include Enumerable
def each
yield bom
yield encoded(header)
generate_csv do |row|
yield encoded(row)
end
end
def header
''
end
def bom
::CsvBase::CsvBase::BOM
end
private
def encoded(string)
string.encode('UTF-16LE', undef: :replace)
end
# WARNING: This will most likely NOT work on jruby!!!
def generate_csv
conn = ActiveRecord::Base.connection.raw_connection
conn.copy_data(export_csv_query) do
while row = conn.get_copy_data
yield row.force_encoding('UTF-8')
end
end
end
def export_csv_query
%Q{copy (#{export_sql}) to stdout with (FORMAT CSV, DELIMITER '\t', HEADER FALSE, ENCODING 'UTF-8');}
end
end
end
Since the CsvBase::CsvBase module includes the Enumerable module its each method will end up getting called by the controller. It consists of three steps:
- It first yields a BOM (byte order mark) which is important for opening the CSV in Excel on windows. It can be overriden in the client CsvExport class if its not needed or there needs to be something different.
- It yields a properly encoded header row. The client provides the header.
- It calls to the generate_csv method yielding whatever that method yields.
The generate_csv method is the interesting part. It uses a low (level API)[//deveiate.org/code/pg/PG/Connection.html] of the postgres connnector. The connector receives a (copy command)[//www.postgresql.org/docs/9.2/static/sql-copy.html] with a sql query constructed to select all rows with a custom delimiter. Since the row returned from Postgres is a csv string all Ruby has to do is properly encode the string and pass it along.
That is all there is to it. The download will now be super-fast and consume a constant amount of memory while generating the output.
For comparison, lets create a new exporter that uses standard find_each approach.
module Items
class CsvExport
include Csv::CsvBase
def initialize(items)
@items = items
end
private
attr_reader :items
def header
CSV.generate_line(['a column', 'another column'], col_sep: "\t").to_s
end
def generate_csv
items.find_each do |item|
CSV.generate_line([item.a_column, item.association.another_column], col_sep: "\t").to_s
end
end
end
end
Exporting 500k records:
- find_each approach
- Time: ~450s
- Peak Memory: ~1.8GB
- raw_connection approach
- Time: ~90s
- Peak memory: ~400MB
Conclusion
When the middleman is cut out there is an observable 5x speed/memory performance improvement without without any change in behavior. Starting out with this kind of a solution is not recommended because there are a few caveats to consider when using this kind of solution.
- We are tying ourselves to postgres by using the details about the connection. It is not a problem in most apps, but it is a thing to consider.
- Problems on non-MRI environments - the raw_connection code will most likely cause issues on JRuby.
- Code is more complicated - the example is factored in such a way that the details on how to do build the content of the CSV are encapsulated in a single method, but its still an additional burden on the developers maintaing that code.
It is therefore preferable to start with a simple approach and only use this kind of a solution when Rails becomes a real bottleneck.