CSV — Process Local & Remote Files in Python

This article presents an open-source tool that is able to process local and remote CSV files, load and print the information, and later map the column to Django Types. Processing CSV files is usually required when the dataset becomes large, custom reports are unsupported by Excel or full data manipulation via data tables, and API is needed.
The current list of features can be further extended to map CSV files to database tables/models and fully generate dashboard web apps.

Source Code: CSV Processor part of the AppSeed Service (open-source)

Before we start explaining the code and the usage, let's summarize the tool features:

  • load local and remote files
  • print values
  • print detected column types
  • print the mapping types to a Django Model

The CSV parser can be executed via the CLI after cloning the project sources and making it usable as explained in the README. Once the installation is completed, we can call the CVS processor using this one-liner:

$ python manage.py tool_inspect_source -f media/tool_inspect/csv_inspect.json

The tool performs the following tasks:

  • validate the input
  • locate the CSV file (exit with error if not found)
  • loads the information and detects the column types
  • detects the Django column type
  • print the first 10 rows

The same can be applied to local and remote files. For instance, we can analyze the notorious Titanic.cvs by running this one-liner:

$ python manage.py tool_inspect_source -f media/tool_inspect/csv_inspect_distant.json

# Output
> Processing .\media\tool_inspect\csv_inspect_distant.json
    |-- file: https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv
    |-- type: csv


Field        CSV Type    Django Types
-----------  ----------  ------------------------------------------
PassengerId  int64       models.IntegerField(blank=True, null=True)
Survived     int64       models.IntegerField(blank=True, null=True)
Pclass       int64       models.IntegerField(blank=True, null=True)
Name         object      models.TextField(blank=True, null=True)
Sex          object      models.TextField(blank=True, null=True)
Age          float64     models.FloatField(blank=True, null=True)
SibSp        int64       models.IntegerField(blank=True, null=True)
Parch        int64       models.IntegerField(blank=True, null=True)
Ticket       object      models.TextField(blank=True, null=True)
Fare         float64     models.FloatField(blank=True, null=True)
Cabin        object      models.TextField(blank=True, null=True)
Embarked     object      models.TextField(blank=True, null=True)


[1] - PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
[2] - 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
[3] - 2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
[4] - 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
[5] - 4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
[6] - 5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
[7] - 6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
[8] - 7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
[9] - 8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.075,,S
[10] - 9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S
... (truncated output)  

Python CSV Processor - Execution and Output


Here are the relevant parts of the tool:

Loads the information and prior checks the source if is local or remote
print( '> Processing ' + ARG_JSON )
    print( '    |-- file: ' + JSON_DATA['source'] )
    print( '    |-- type: ' + JSON_DATA['type'  ] )
    print( '\n')

    tmp_file_path = None 

    if 'http' in JSON_DATA['source']:
        url = JSON_DATA['source']
        r = requests.get(url)
        tmp_file = h_random_ascii( 8 ) + '.csv'
        tmp_file_path = os.path.join( DIR_TMP, tmp_file )
        if not file_write(tmp_file_path, r.text ):
            return
        JSON_DATA['source'] = tmp_file_path
    else:    
        if not file_exists( JSON_DATA['source'] ):
            print( ' > Err loading SOURCE: ' + JSON_DATA['source'] )            
            return

    csv_types = parse_csv( JSON_DATA['source'] )

Python CSV Processor - Loads the information


Analyze the headers and map the detected types to Django Types.

For the tabular view, Tabulate Library is used:

    csv_types = parse_csv( JSON_DATA['source'] )

    #pprint.pp ( csv_types )

    table_headers = ['Field', 'CSV Type', 'Django Types']
    table_rows    = []

    for t in csv_types:
        t_type        = csv_types[t]['type']
        t_type_django = django_fields[ t_type ]
        table_rows.append( [t, t_type, t_type_django] )

    print(tabulate(table_rows, table_headers))

Python CSV Processor - Scan column types


The last step is to Print the CSV data:
    csv_data = load_csv_data( JSON_DATA['source'] )

    idx = 0
    for l in csv_data:
        idx += 1
        print( '['+str(idx)+'] - ' + str(l) )  

        # Truncate output ..
        if idx == 10:
            print( ' ... (truncated output) ' ) 
            break 

Python CSV Processor - print data

At this point, the code provides us access to the CSV information, data types, and the correspondent data types for Django. The mapping can be easily extended for any framework like Flask, Express, or NextJS.

The type mapping for Django is this one:

# Pandas Type
django_fields = {
    'int'           : 'models.IntegerField(blank=True, null=True)',
    'integer'       : 'models.IntegerField(blank=True, null=True)',
    'string'        : "models.TextField(blank=True, null=True)",
    'string_unique' : "models.TextField(blank=True, null=False, unique=True)",
    'object'        : "models.TextField(blank=True, null=True)",
    'object_unique' : "models.TextField(blank=True, null=False, unique=True)",
    'int64'         : 'models.IntegerField(blank=True, null=True)',
    'float64'       : 'models.FloatField(blank=True, null=True)',
    'bool'          : 'models.BooleanField(null=True)',
}

Python CSV Processor - Data mapping to Django


This tool is under active development and here are the next steps:

  • connect the tool to more data sources like remote/local databases (SQLite, MySql, PgSQL), JSON
  • Generate Models for any framework: FastAPI, Flask, Express, NextJS
  • Generate secure APIs on top
  • Generate server-side paginated DataTables using Tailwind/Bootstrap for styling

Thanks for reading!

For those interested in contributing, feel free to join the new AppSeed platform and connect with the community on Discord: