- #Edgar html text encoding update#
- #Edgar html text encoding full#
- #Edgar html text encoding software#
- #Edgar html text encoding download#
In other words, paths to raw text filings are not visible by simply looking into the EDGAR FTP server. But we cannot find any path to any raw text filing. In the above example, we can find the path to “master.idx” by navigating on the EDGAR FTP server.
#Edgar html text encoding download#
For example, if we navigate a bit on the EDGAR FTP server, we can find the path to the file “master.idx” as follows:Ĭopy the path into an Internet browser or an FTP software, we can download the file directly.
#Edgar html text encoding software#
Usually, if we know the path or URL to a file on an FTP server, we can easily use an Internet browser or an FTP software to connect to the server and download the file. SEC provides an anonymous EDGAR FTP server to access raw text filings. The web search interface is convenient, but we may need to bulk download raw text filings. We know that SEC makes company filings (e.g.,10-Ks, 10-Qs and 8-Ks) publicly available on EDGAR. So, I provide multiple downloadable Stata datasets that include all index files from 1993 Q1 to October 6, 2018. The technical details may be too boring to most people. to_stata ( 'edgar_idx.dta', version = 117 ) With nnect() as conn, conn.begin():ĭata = pandas.read_sql_table('idx', conn)ĭata.to_stata('edgar_idx.dta', version=117)ĭata. Print(url, 'downloaded and wrote to SQLite')Įngine = create_engine('sqlite:///edgar_idx.db') Records = ]Ĭur.executemany('INSERT INTO idx VALUES (?, ?, ?, ?, ?)', records) Lines = requests.get(url).code("utf-8", "ignore").splitlines() # Download index files and write content into SQLiteĬur.execute('CREATE TABLE idx (cik TEXT, conm TEXT, type TEXT, date TEXT, path TEXT)') History.append((current_year, 'QTR%d' % i)) Years = list(range(start_year, current_year)) # start_year = 2016 # only change this line to download the most recent chunk Start_year = 2011 # change start_year and end_year to re-define the chunkĬurrent_year = 2015 # change start_year and end_year to re-define the chunkĬurrent_quarter = 4 # do not change this line # the starting of the next three lines, and define the start_year that immediately follows the ending year of the # files up to the most recent year and quarter, comment out the following three lines, remove the comment sign at # download index files during 2001–2005 by changing the following two lines repeatedly, and so on. For example, please first download index files during 1993–2000, then # Please download index files chunk by chunk. # Generate the list of index files archived in EDGAR since start_year (earliest: 1993) until the most recent quarter
#Edgar html text encoding update#
So the description about the FTP server in the original post is not applicable any more (but the basic idea about the URLs to raw text filings remain unchanged.) Since then I have received several requests to update the script. SEC closed the FTP server permanently on Decemand started to use a more secure transmission protocol-https. My initial thoughts about his updated module is that it provides more flexibility and should be more robust than mine.
#Edgar html text encoding full#
The major updates to his module include: (1) he migrated the file download from FTP to HTTPS and (2) added parallel downloads so now it is faster to rebuild the full index, especially if going all the way to 1993. Edouard kindly informed me that he had updated his module (see his GitHub page). As I acknowledged in the very first edition of this post, I borrowed some codes from Edouard Swiac’s Python module “python-edgar” (version: 1.0). I would suggest directing our research efforts to html-format filings with the help of BeautifulSoup. However, the landscape of 10-K/Q filings has changed dramatically over the past decade, and the text-format filings are extremely unfriendly for researchers nowadays. This post, together with its sibling post “ Part II“, has been my most-viewed post since I created this website.