Opening Up the Law: Pacer, CITP, and the RECAP the Law Project

Deven Desai

Deven Desai is an associate professor of law and ethics at the Scheller College of Business, Georgia Institute of Technology. He was also the first, and to date, only Academic Research Counsel at Google, Inc., and a Visiting Fellow at Princeton University’s Center for Information Technology Policy. He is a graduate of U.C. Berkeley and the Yale Law School. Professor Desai’s scholarship examines how business interests, new technology, and economic theories shape privacy and intellectual property law and where those arguments explain productivity or where they fail to capture society’s interest in the free flow of information and development. His work has appeared in leading law reviews and journals including the Georgetown Law Journal, Minnesota Law Review, Notre Dame Law Review, Wisconsin Law Review, and U.C. Davis Law Review.

You may also like...

7 Responses

  1. Sarah L. says:

    This is fantastic. Is there a way to browse the publicly available documents at the Internet Archive without logging onto Pacer, or a search engine for the documents that gets around having to log into Pacer?

  2. Harlan Yu says:

    Hi Sarah, the public repository can’t be searched or indexed by crawlers just yet because of privacy reasons. In the past, the Courts haven’t been very good at enforcing their own rule that requires attorneys to submit redacted versions of their filings. They’re starting to be more strict about this, but there is still plenty of private data that can be mined out of the existing documents.

    That said, if you know the court name and PACER case number, you can manually look at cases at the repository:

    http://www.archive.org/details/gov.uscourts.court.casenum

    [court] is the short abbreviation of the court name from the ECF domain name (e.g. ‘cand’ for Northern District of California) and [casenum] is the case number from PACER.

  3. Sarah L. says:

    Thank you–that’s very helpful. And let me just reiterate how incredibly great this project is.

  4. Harlan

    You state: “That said, if you know the court name and PACER case number”

    That is really not correct – you have not correctly described what RECAP does.

    The Pacer (actually CM/ECF) case number you RECAP elected to use is not the Docket Number but rather a hidden unique number used by the CM/ECF database. One can find this only if one inspects the source code of the docket html page. This unique case number is obscure and of no meaning to most people. It is like identifying you by your SS number, rather than your name.

    In addition, one would need to know the exact Docket Entry number for the document. So, your specification is not complete.

    People also need to know that judicial opinions marked by a judge as such are already completely free on PACER – CM/ECF as long as one registered for a user name. FREE.
    Example:
    https://ecf.nysd.uscourts.gov/doc1/12703000666.pdf.

    Also, websupp.org has many district court cases obtained this way from the free opinions on CM/ECF. But, they do something better than RECAP. They have a meaningful file name AND they stuff all of the metadata into the properties of the pdf file and the opinions are searchable on Google.

    RECAP would be much better if it used an understandable file name with the docket number and included the metadata on the docket sheet in the pdf file.

    That being said, the RECAP concept is brilliant and the programming is expert – but more needs to make this effective and to persuade attorneys to sign on and offer up free documents.

    Alan Sugarman

  5. Harlan

    It occurred to me that you meant you would see a list of documents so I checked again to see if the directory was exposed, and it was.

    I now see that if one does happen to know the ECF case number, then one can see a list of all documents uploaded as to the case number – since the directory is exposed.

    Now having seen that, I checked out the files in this directory for a file I uploaded yesterday:
    http://ia311002.us.archive.org/1/items/gov.uscourts.vid.10330/

    I see now the metadata and the docket sheet that I looked at yesterday.

    First, the docket sheet html files leaves out all of the very valuable information at the start of the docket including type of case, and names of attorney, and the name of the judge. Most important, you leave out the coverage period for the docket report.

    I wonder why all of this was not captured.

    Second, it appears you collected the docket sheet report – in part – that I elected to ask for. The problem there is that one can ask for a docket sheet only for certain dates – this is done is big cases. So, I assume that when this is done, there will be an overwrite I assume.

    Third, I looked at the \”metadata\” for the actual document uploaded. This is very limited and should be compared to the breadth of data on the written opinions report that CM/ECF provides (am I the only person who logs into ecf.nysd and see CMECF at the top of the page – why do people refer to this as PACER???) See how websupp.org does this. Much better. What you need to do is to fully populate the xml file and attach it to the pdf file. The matedata ought to go with the file. Sorry if this spoils your hash. Also, I see that you may want to attach the case meta xml file as well to the pdf file.

    This is what you have for the metadata for this document:
    ETag: \”6cd55dac216aa2d147c30312185db880\”
    accept-encoding: identity
    authorization: LOW MtXL0tEgFmJcLXjr:REDACTED_BY_IA_S3
    connection: close
    content-length: 38327
    content-type: application/x-www-form-urlencoded
    host: s3.us.archive.org
    user-agent: Python-urllib/2.5
    x-archive-meta-attachment-num: 0
    x-archive-meta-available: 1
    x-archive-meta-collection: usfederalcourts
    x-archive-meta-court: vid
    x-archive-meta-doc-num: 370
    x-archive-meta-language: eng
    x-archive-meta-mediatype: texts
    x-archive-meta-neverindex: true
    x-archive-meta-noindex: true
    x-archive-meta-pacer-case-num: 10330
    x-archive-meta-pacer-doc-id: 1930210992
    x-archive-meta-sha1: 95b4d596913e6f653f8df2134f989b7a34f54fa7
    x-archive-meta-upload-date: 2009-08-15 11:43:45
    x-archive-queue-derive: 0
    x-upload-date: 2009-08-15T16:34:58.000Z

    Incredibly, for the meatadata for this document, you omit the docket number of the case – one of the most important pieces of information. So, one could not search the xml file and find the case by docket number!!!!! But, you did include this in the docket xml file. It should be in both. I am assuming of course that the opinion files are ones that people will want to search on the internet at some point in the future. Oops. Now I see you left out the case name in the document xml, but it is in the docket html.

    For example, you do not even have the judges name. Your also should drop in the name of the court in the metadata although also in the file data.

    Basically, you need to parse out each field you can identify in the CMECF database – and have a separate entry in the metadata. And, the metadata of a separate doc file should include the metadata for the case.

    You also need more comprehensive descriptors – for example, the metadata should incude a line stating \”United States District Court\” – although one could claim this implied in the cryptic file name \”vid.\”

    You could also help out by counting the number of pages and characters in the document using standard PDF SDK.

    Anyway, whatever you do, please do not toss out the information in the docket sheet header.

    Alan

  6. Harlan Yu says:

    Hi Alan, thanks for your comments- I agree that these are all important issues. This release is just the first iteration of the project and we look forward to working with you and others to make these documents as useful as possible. A few responses:

    – We’ve open-sourced the client and would love to see outside developers add features and submit patches. Implementing better filenames is definitely one of these client-side features that somebody could run with (though, it might not as easy as it sounds… will probably need a client-side cache of case names, etc.)

    – We’re slowly improving our scraping to gather more metadata from the docket sheet header. As you may have noticed, each instance of CM/ECF can choose to style their HTML pages differently, so a bit of logic and lots of testing is needed to make sure we’re scraping correctly for each court.

    – There’s now a centralized feedback forum for RECAP: http://recapthelaw.uservoice.com It would be great if you can enter all of your specific suggestions there!

    If you have more technical questions, I’m happy to continue the discussion off-forum.

    Harlan.