200 Success

Mapping repository dependencies with Github CLI and Python

I transitioned into my new role as an application security engineer, and immediately started diving in on a Shift Left effort. My goal for the first quarter is to get static analysis running on every developer’s local system using semgrep. I had hit the ground running, was making progress, and as per usual in infosec we had an urgent request come in for a potential client.

The request was simple enough, something along the lines of “Legal needs a comprehensive list of all the open source dependencies our software uses. We’re trying to get the contract out within 30 days.” Cool, cool. So basically a promise was made to the potential client and we have to deliver this artifact like, last week.

I figured a request like this is common enough, and we even have a tool called Cider which provided supply chain data. So, trying to be efficient I thought, well let’s just export the list of all the things, narrow the list down to production only repos, then de-duplicate. Done! But, as any senior engineer knows if it’s that easy, something isn’t right. Well, in this case when trying to export a CSV with 36k records, the request times out on Cider.

Ok, so that won’t work. I’ll just use the Github API, and I’ve been wanting to expand my golang chops. This should be a great opportunity for that!

Well, golang is kind of an asshole and picking up some syntactical details I’d glossed over in the Golang for dummies tutorials I’ve read came back to bite me. Finally I got something going but there was a problem — my personal access token was only fetching my personal repos on Github, not the repos associated with my organization. Off to another adventure into the depths of Github documentation to figure out something else I didn’t know I didn’t know, which was cool because now our organization has PAT access hardened (another quick win).

TL;DR: I was finally able to fetch all the repos within my organization using https://pkg.go.dev/github.com/google/go-github/v50/github.

BUT another problem — there’s no f’ing way to fetch a repository’s dependency map via Github’s RESTful API :facepalm:. To do that, you have to use the GraphQL API, which was yet ANOTHER language I had to pick up. Truth be told, I gave up on this effort after about three hours of head bashing and much colorful language.

Then something amazing happened… a fellow engineer cough thanks @spaceB0xx cough sent me this link https://github.com/andyfeller/gh-dependency-report and said “maybe this will help.” A quick review of the code sparked my interest and I went to bed hopeful. The next morning I got up, logged on, and ran:

gh extension install andyfeller/gh-dependency-report
gh dependency-report Organization backend-service

A REPORT WAS GENERATED! I vengefully deleted the three directories containing various attempts for doing what this extension did in about 30 seconds, and proceeded to run:

gh dependency-report Organization
^C

I Ctrl-Ced as soon as the massive list of 600+ repositories appeared as we have a ton of dead repos, internal tooling repos, playgrounds, etc. I only needed ~170 of these repositories in the resulting report so after cleaning up that list I was able to run the extension with the admittedly still massive list of repos and go for a walk. When I returned, we finally had a 25MB file containing a list of all production-impacting dependencies and I was on to the easy part: de-duplicate the list, break the list down into separate files containing language specific dependencies, remove internal dependencies, and produce one last file listing the unique licenses on each of those dependencies.

I won’t cover what this code does line for line, but after a couple hours of hacking this thing together we have our deliverable in hand, and tooling to make sure when the next request of this kind comes in we aren’t scrambling like we were this time round.

import csv
import pandas as pd

FILE_NAMES = {
    'PIP': 'python_dependencies',
    'NPM': 'javascript_dependencies',
    'COMPOSER': 'php_dependencies',
    'NUGET': 'dotnet_dependencies',
    'ACTIONS': 'github_dependencies',
    'RUBYGEMS': 'ruby_dependencies'
}

with open('full_dependency_list-prod-repos.csv') as file_in:
    print('[!] Processing report....')
    csv_in = csv.DictReader(file_in)
    keep_fields = ['Dependency', 'Version', 'License Type', 'License URL']
    outputs = {}

    licenses = set()

    for row in csv_in:
        pkg_manager = row['Package Manager']
        # open new file and write the header
        if pkg_manager not in outputs:
            file_name = pkg_manager
            if pkg_manager in FILE_NAMES:
                file_name = FILE_NAMES.get(pkg_manager)

            print(f'[!] Creating {file_name}.csv....')
            file_out = open(f'{file_name}.csv', 'w', newline='')
            dw = csv.DictWriter(file_out, fieldnames=keep_fields)
            dw.writeheader()
            outputs[pkg_manager] = file_out, dw

        # fetch only the fields we want
        fields_to_write = {
            'Dependency': row.get('Dependency'),
            'Version': row.get('Requirements'),
            'License Type': row.get('License'),
            'License URL': row.get('License Url')
        }

        # write the row
        outputs[pkg_manager][1].writerow(fields_to_write)
        licenses.add(f'{row.get("License")} {row.get("License Url")}')

    # close all files
    for file_out, _ in outputs.values():
        print(f'[!] Closing {file_out.name}...')
        file_out.close()

    # generate licenses file
    license_out = open('unique_licenses.csv', 'w', newline='')
    license_header = ['License Type', 'License URL']
    dw = csv.DictWriter(license_out, fieldnames=license_header)
    dw.writeheader()

    for lic in licenses:
        temp = lic.split(' ')
        fields_to_write = {
            'License Type': temp[0],
            'License URL': temp[1]
        }
        dw.writerow(fields_to_write)

    license_out.close()

for f in FILE_NAMES:
    file_name = FILE_NAMES.get(f)
    print(f'[!] De-duplicating {file_name}.csv....')
    df = pd.read_csv(f'{file_name}.csv')
    df.drop_duplicates(inplace=True)
    df.to_csv(f'{file_name}.csv', index=False)

print('[!] Done!')

Hopefully this helps someone else out there faces with a similar problem. Thanks for reading.