top of page
Search
chongdelapaz652qw6

[ HUMOR] Dead With RegEx



There have been over 1,100 links to this question/answer since 2008 (that's nearly 2% of all questions tagged with regex - and that includes questions before this answer was posted). There have been nearly 150 this year alone.


In the cases I've seen, the OP was relatively new or clueless about regexes. This answer does not help them - it only contains a very weak suggestion at the end that a parser should be used (without ever actually explaining why regexes won't work - a very ineffective argument for the determined "n00b"). In my opinion, to many new users, it could easily come off as rude, caustic or snooty. I don't think that's the reputation we want for StackOverflow.




[ HUMOR] Dead with RegEx



To be clear: parsing HTML with the help of regex, while possible in some regex flavors, should be highly discouraged. The cases where regex is an easier/cheaper tool than a true HTML parser are few and far between. That is not the issue... the issue is that this particular answer, at the very, very least, according to our own standards, is not a good answer.


I spend some time going through the list of answers in that question, and find that most of the answers don't offer a way out. In my opinion, a good explanation on why regex should not be used to parse HTML, with some library recommendation at the end makes a good reference to link to whenever someone tries to parse HTML with regex.


David's answer basically just consists of a link to the REX paper by Robert D. Cameron. It's actually a pretty nice paper, and it does contain code for a usable and reasonably compact regexp-based XML parser in the appendices. Alas, REX is an XML parser, not an HTML parser, and it cannot cope with various features (like unquoted attribute values) that are legal in HTML but not in XML.


Some of the historical context is missing for that question. There are certain restricted subsets of HTML and use cases where regex will do. This always made it difficult to work out when you needed what. You can also parse HTML with algorithms involving multiple chunks of regex. While you're using regex, you're also filling in using the logic and semantics of your programming languages such as loops, recursion, stacks, variables, etc.


Such libraries haven't always had such obvious interfaces either that make it as simple as it would seem with regex to take a string and for example extract a list of specific tags. Although this is also an improving front it's something that's often over looked.


Politicians are a common target of vandalism on Wikipedia. The article on Donald Trump was replaced with a single sentence critical of him in July 2015,[24][25][26] and in November 2018, the lead picture on the page was replaced with an image of a penis, causing Apple's virtual assistant Siri to briefly include this image in answers to queries about the subject.[27] Both Hillary and Bill Clinton's Wikipedia pages were vandalized in October 2016 by a member of the Internet trolling group Gay Nigger Association of America adding pornographic images to their articles.[28] That same month, New York Assembly candidate Jim Tedisco's Wikipedia page was modified to say that he had "never been part of the majority", and "is considered by many to be a total failure". Tedisco expressed dismay at the changes to his page.[29] On 24 July 2018, Utah senator Orrin Hatch posted humorous tweets after Google claimed that he had died on 11 September 2017,[30] with the error being traced back to an edit to his Wikipedia article.[31][32][33] Similarly, vandalism of the California Republican Party's Wikipedia page caused Google's information bar to list Nazism as one of the party's primary ideologies.[34]


The problem in the comic is not with regexes per se but with situations when the entered text or expression passes through several interpreters, like bash -> grep/sed/awk, or program text -> external shell command. In such cases, you have to escape backslashes for each program in the sequence, and it gets worse if you have 'real' backslashes in the final text that you're processing with the utilities (Windows' file paths, for example). See _toothpick_syndrome.Feel free to lift this to the explanation page, since I'm not good at longer and more careful explanations than this one.Also, gotta notice that Feedly stripped paired backslashes in the title text (probably passed it through some 'interpreter' embedded in its scripts). Aasasd (talk) 10:13, 3 February 2016 (UTC)


Attempting to add to the discussion: This regex is not necessarily invalid or incomprehensible. (Note: The regex changed after initial publication. See Changed Regex below.) It looks like he was looking for a line with a regular expression or definitely some code. You just have to work your way through the backslashes. Although it might be invalid depending on the precise rules. He has some unescaped closing brackets and closing parenthesis. If these have to always be escaped then the regex is invalid. If however you don't have to escape a closing bracket with no opening bracket, then things are fine. I'm not familiar enough with grep's regex parser to know how it handles that edge case. Presuming those unescaped paren and brackets are fine, his regex searches for:


I suspect that Randall may have used the regexp in the title text to *find* malformed regular expressions in a file (out.txt) that he (or someone) had previously filled with output from some error message (or collection of error messages, or at least the output of something where a regular expression had been expected to work but had not worked as expected). 162.158.252.227 19:06, 3 February 2016 (UTC)


Your analysis is thorough and correct, however it is unlikely this is what the regex was intended to accomplish. (Note: The regex changed after initial publication. See Changed Regex below.) More likely, Randall is more accustomed to other regex dialects such as Perl(-compatible) regex where a backslash does work to escape special characters inside a character class. Under that assumption the regex (with some whitespace inserted for readability) would break up as:


Although the final condition is still a bit obscure, this still makes a lot more sense. Unfortunately it also crushes Randall's hope the regex worked as intended, since this simply isn't how the expression is parsed with grep's default syntax (which is why I always use grep -P). --141.101.75.185 15:34, 4 February 2016 (UTC)


Funny enough, I'm literally looking at some other dev's code right now that actually implements an eight backslash regex sequence, with just the comment "backslash". I'm still scratching my head over what they were trying to accomplish or even communicate with this. Domino (talk) 21:45, 16 August 2016 (UTC)domino


Here's the game site: Remember to login with your GitHub account. Once you've worked through the Intermediate level, take a screen shot at upload it to your ctl GitHub repo with the name "regex_crossword", the file extension doesn't matter, but .png, .jpg, or .pdf are preferred. Check to make sure that you can see your screenshot at the appropriate URL. For example:


Do not limit yourself to the resources provided when working on your crossword (or anything for that matter). Part of what we're trying to do in this class is to acquaint you with nomenclature so that you can efficiently seek out your own solutions. See the Tech Support Cheatsheet flowchart. You know you're looking to understand regex. You know how to Google, embrace this. I've budgeted the time for this such that it assumes you having to do some research while playing. That being said, start with what's at hand. The crossword's Help feature is really helpful. ;)


When using the verbose flag, you can no longer use ' ' to matcha space character; use \s instead.The regexp_tokenize() function has an optional gaps parameter.When set to True, the regular expression specifies the gapsbetween tokens, as with re.split().


When a computer is expected to perform a long task, its not considered polite to 'lock it up', i.e. make it difficult or impossible for the user to interact with it until the task completes. There are exceptions to this, however in general its a good policy and tends to lead to less annoyed users. Many developers, when confronting this need turn to threads, but often threads introduce far more problems than they solve. If you've ever tracked down race conditions or deadlocks in threads, I'm sure you know exactly what I mean.


Inspection of the code shows that the first test determines if there is anything at all to do, i.e. if the WebScraper is stopped or paused, it simply returns without comment. For those with a sense of humor, it's conditions like this where you can have the system maintain some basic count and complain at the user to 'just do something, anything!' if this count exceeds a certain preset. In more practical applications, especially when security is an issue, this is an excellent means for automatically terminating unused applications that shouldn't be left open.


There is a method provided for each class of data we wish to retrieve that allows us to know where to retrieve it from. Since any attempt at retrieval will produce some kind of response from the server, if nothing else than a 401 error, we pair the attempt to retrieve the data with a simple test to determine if the data we want wasn't successfully retrieved. Note that distinction carefully - I'm not saying we have a test to know we got what we want, we instead have a test to know we didn't get what we want. The only true way to know you got what you want is to inspect the result in detail, there is nothing that says that the server couldn't have simply dropped dead halfway through the process of returning legitimate data to you. On the other hand, there is probably a pretty easily established criteria that tells you that you absolutely didn't get what you want. In this specific case if you examine the condition for entering the second switch statement in processRecord you can see it's wrapped in a conditional that tests to see if the size of the returned page is at least 1500 bytes. If not, it's impossible for this page to contain enough data to satisfy us, and we can safely assume it doesn't contain what we want. On the other hand, if it does contain more than 1500 bytes, that doesn't necessarily mean that it's got all the data we want, it only means its worth spending further effort to see if the data we want is there. 2ff7e9595c


1 view0 comments

Recent Posts

See All

Comments


bottom of page